Work on 2015-07-20

I spent the last few days learning about the openSUSE Ceph installation process. I ran into some issues, and I’m not done yet, so these are just my working notes for now. Once complete, I’ll write up the process on my regular blog.

Prerequisite: build a tool to build and destroy small clusters quickly

I needed a way to quickly provision and destroy virtual machines that were well suited to run small Ceph clusters. I mostly run libvirt / kvm in my home lab, and I didn’t find any solutions tailored to that platform, so I wrote ceph-libvirt-clusterer.

Ceph-libvirt-clusterer lets me clone a template virtual machine and attach as many OSD disks as I’d like in the process. I’m really happy with the tool considering that I only have a day’s worth of work in it, and I got to learn some details of the libvirt API and python bindings in the process.

Build a template machine

I built a template machine with openSUSE’s tumbleweed and completed the following preliminary configurations:

  • created ceph user
  • ceph user has a SSH key
  • ceph user’s public key is in the ceph user’s authorized_keys file
  • ceph user is configured for passwordless sudo
  • emacs is installed (not strictly necessary :-) )

Provision a cluster

I used ceph-libvirt-clusterer to create a four node cluster, and each node had two 8GB OSD drives attached.

Install Ceph with ceph-deploy

Once the machines were built, I followed the SUSE Enterprise Storage Documentation

The ceph packages aren’t yet in the mainline repositories, so I added it to the admin node:

$ sudo zypper ar -f http://download.opensuse.org/repositories/filesystems:/ceph/openSUSE_Tumbleweed/ ceph
$ sudo zypper update
Retrieving repository 'ceph' metadata ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------[\]
 
New repository or package signing key received:
 
Repository: ceph
Key Name: filesystems OBS Project <filesystems@build.opensuse.org>
Key Fingerprint: B1FB5374 87204722 05FA6019 98C97FE7 324E6311
Key Created: Mon 12 May 2014 10:34:19 AM EDT
Key Expires: Wed 20 Jul 2016 10:34:19 AM EDT
Rpm Name: gpg-pubkey-324e6311-5370dbeb
 

Do you want to reject the key, trust temporarily, or trust always? [r/t/a/? shows all options] (r): a
Retrieving repository 'ceph' metadata .........................................................................................................................................................................[done]
Building repository 'ceph' cache ..............................................................................................................................................................................[done]
Loading repository data...
Reading installed packages...

Bash

And ceph packages were visible:

tim@linux-7d21:~> zypper search ceph
Loading repository data...
Reading installed packages...
 
S | Name | Summary | Type
--+--------------------+---------------------------------------------------+-----------
| ceph | User space components of the Ceph file system | package
| ceph | User space components of the Ceph file system | srcpackage
| ceph-common | Ceph Common | package
| ceph-deploy | Admin and deploy tool for Ceph | package
| ceph-deploy | Admin and deploy tool for Ceph | srcpackage
| ceph-devel-compat | Compatibility package for Ceph headers | package
| ceph-fuse | Ceph fuse-based client | package
| ceph-libs-compat | Meta package to include ceph libraries | package
| ceph-radosgw | Rados REST gateway | package
| ceph-test | Ceph benchmarks and test tools | package
| libcephfs1 | Ceph distributed file system client library | package
| libcephfs1-devel | Ceph distributed file system headers | package
| python-ceph-compat | Compatibility package for Cephs python libraries | package
| python-cephfs | Python libraries for Ceph distributed file system | package

Bash

First issue: python was missing on the other nodes

When I installed ceph-deploy on the admin node, python was also installed. The other nodes were still running with a bare minimum configuration from the tumbleweed install, so python was missing, and ceph-deploy’s install step failed.

I installed Ansible to correct the problem on all nodes simultaneously, but Ansible requires python on the remote side, too. That meant I had to manually install python on the remaining three nodes just like sysadmins had to do years ago.

Second issue: all nodes need the OBS repository

I didn’t add the OBS repository to the remaining three nodes because I wanted to see if ceph-deploy would add it automatically. I didn’t expect that to be the case, but since this version of ceph-deploy came directly from SUSE, there was a chance.

Fortunately Ansible works now:

ceph@linux-7d21:~/tinyceph> ansible -i ansible-inventory all -a "sudo zypper ar -f http://download.opensuse.org/repositories/filesystems:/ceph/openSUSE_Tumbleweed/ ceph"
192.168.122.122 | success | rc=0 >>
Adding repository 'ceph' [......done]
Repository 'ceph' successfully added
Enabled : Yes
Autorefresh : Yes
GPG Check : Yes
URI : http://download.opensuse.org/repositories/filesystems:/ceph/openSUSE_Tumbleweed/
 
# and three more nodes worth of output...
 
ceph@linux-7d21:~/tinyceph> ansible -i ansible-inventory all -a "sudo zypper --gpg-auto-import-keys update"

Bash

Once both of these commands completed, ceph-deploy install worked as expected.

Third issue: I was using IP addresses

ceph-deploy new complains when provided with IP addresses:

ceph@linux-7d21:~/tinyceph> ceph-deploy new 192.168.122.121 192.168.122.122 192.168.122.123 192.168.122.124
usage: ceph-deploy new [-h] [--no-ssh-copykey] [--fsid FSID]
[--cluster-network CLUSTER_NETWORK]
[--public-network PUBLIC_NETWORK]
MON [MON ...]
ceph-deploy new: error: 192.168.122.121 must be a hostname not an IP

Bash

In the future, it’d be pretty cool if ceph-libvirt-clusterer supported updating DNS records so I didn’t need to resort to the host file ansible playbook that I used today:

---
- hosts: all
sudo: yes
tasks:
- name: add tinyceph-00
lineinfile: dest=/etc/hosts line='192.168.122.121 tinyceph-00'
- name: add tinyceph-01
lineinfile: dest=/etc/hosts line='192.168.122.122 tinyceph-01'
- name: add tinyceph-02
lineinfile: dest=/etc/hosts line='192.168.122.123 tinyceph-02'
- name: add tinyceph-03
lineinfile: dest=/etc/hosts line='192.168.122.124 tinyceph-03'
- hosts: 192.168.122.121
sudo: yes
tasks:
- name: update hostname
lineinfile: dest=/etc/hostname line='tinyceph-00' state=present regexp=linux-7d21
- hosts: 192.168.122.122
sudo: yes
tasks:
- name: update hostname
lineinfile: dest=/etc/hostname line='tinyceph-01' state=present regexp=linux-7d21
- hosts: 192.168.122.123
sudo: yes
tasks:
- name: update hostname
lineinfile: dest=/etc/hostname line='tinyceph-02' state=present regexp=linux-7d21
- hosts: 192.168.122.124
sudo: yes
tasks:
- name: update hostname
lineinfile: dest=/etc/hostname line='tinyceph-03' state=present regexp=linux-7d21

YAML

Fourth issue: tumbleweed uses systemd, but ceph-deploy doesn’t expect that

[ceph_deploy.mon][INFO  ] distro info: openSUSE 20150714 x86_64
[tinyceph-03][DEBUG ] determining if provided host has same hostname in remote
[tinyceph-03][DEBUG ] get remote short hostname
[tinyceph-03][DEBUG ] deploying mon to tinyceph-03
[tinyceph-03][DEBUG ] get remote short hostname
[tinyceph-03][DEBUG ] remote hostname: tinyceph-03
[tinyceph-03][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[tinyceph-03][DEBUG ] create the mon path if it does not exist
[tinyceph-03][DEBUG ] checking for done path: /var/lib/ceph/mon/ceph-tinyceph-03/done
[tinyceph-03][DEBUG ] create a done file to avoid re-doing the mon deployment
[tinyceph-03][DEBUG ] create the init path if it does not exist
[tinyceph-03][INFO ] Running command: sudo /etc/init.d/ceph -c /etc/ceph/ceph.conf start mon.tinyceph-03
[tinyceph-03][ERROR ] Traceback (most recent call last):
[tinyceph-03][ERROR ] File "/usr/lib/python2.7/site-packages/remoto/process.py", line 94, in run
[tinyceph-03][ERROR ] reporting(conn, result, timeout)
[tinyceph-03][ERROR ] File "/usr/lib/python2.7/site-packages/remoto/log.py", line 13, in reporting
[tinyceph-03][ERROR ] received = result.receive(timeout)
[tinyceph-03][ERROR ] File "/usr/lib/python2.7/site-packages/execnet/gateway_base.py", line 701, in receive
[tinyceph-03][ERROR ] raise self._getremoteerror() or EOFError()
[tinyceph-03][ERROR ] RemoteError: Traceback (most recent call last):
[tinyceph-03][ERROR ] File "<string>", line 1033, in executetask
[tinyceph-03][ERROR ] File "<remote exec>", line 12, in _remote_run
[tinyceph-03][ERROR ] File "/usr/lib64/python2.7/subprocess.py", line 710, in __init__
[tinyceph-03][ERROR ] errread, errwrite)
[tinyceph-03][ERROR ] File "/usr/lib64/python2.7/subprocess.py", line 1335, in _execute_child
[tinyceph-03][ERROR ] raise child_exception
[tinyceph-03][ERROR ] OSError: [Errno 2] No such file or directory
[tinyceph-03][ERROR ]
[tinyceph-03][ERROR ]
[ceph_deploy.mon][ERROR ] Failed to execute command: /etc/init.d/ceph -c /etc/ceph/ceph.conf start mon.tinyceph-03
[ceph_deploy][ERROR ] GenericError: Failed to create 4 monitors

Bash

Sure enough, a little manual inspection revealed no file at /etc/init.d/ceph and systemd integration:

ceph@tinyceph-00:~/tinyceph> ls -la /etc/init.d/ceph
ls: cannot access /etc/init.d/ceph: No such file or directory
ceph@tinyceph-00:~/tinyceph> sudo service ceph status
* ceph.target - ceph target allowing to start/stop all ceph*@.service instances at once
Loaded: loaded (/usr/lib/systemd/system/ceph.target; disabled; vendor preset: disabled)
Active: inactive (dead)
 
Jul 19 23:50:35 tinyceph-00 systemd[1]: Reached target ceph target allowing to start/stop all ceph*@.service instances at once.
Jul 19 23:50:35 tinyceph-00 systemd[1]: Starting ceph target allowing to start/stop all ceph*@.service instances at once.
Jul 19 23:50:47 tinyceph-00 systemd[1]: Stopped target ceph target allowing to start/stop all ceph*@.service instances at once.
Jul 19 23:50:47 tinyceph-00 systemd[1]: Stopping ceph target allowing to start/stop all ceph*@.service instances at once.
ceph@tinyceph-00:~/tinyceph> sudo service ceph start
ceph@tinyceph-00:~/tinyceph> sudo service ceph status
* ceph.target - ceph target allowing to start/stop all ceph*@.service instances at once
Loaded: loaded (/usr/lib/systemd/system/ceph.target; disabled; vendor preset: disabled)
Active: active since Mon 2015-07-20 00:24:01 EDT; 4s ago
 
Jul 20 00:24:01 tinyceph-00 systemd[1]: Reached target ceph target allowing to start/stop all ceph*@.service instances at once.
Jul 20 00:24:01 tinyceph-00 systemd[1]: Starting ceph target allowing to start/stop all ceph*@.service instances at once.

Bash

I learned that this is a known bug, and I’ll try all of this again with an older version of openSUSE.

… and that’s where I’m calling it a night. I’ll be back at it this week.

Work on 2014-07-30

Last time I worked on Kallithea’s CI, I got some errors. On a fresh Ubuntu 14.04 VM without docker, I get the following test results:

In a Virtual Machine

sqlite: 0 errors, 2 skipped

mysql: 0 errors, 2 skipped

postgresql: 1 error, 2 skipped

details:

======================================================================
ERROR: test_index_with_anonymous_access_disabled (kallithea.tests.functional.test_home.TestHomeController)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/packer/src/kallithea-pg/kallithea/tests/functional/test_home.py", line 43, in test_index_with_anonymous_access_disabled
    status=302)
  File "/home/packer/src/kallithea/.venv/local/lib/python2.7/site-packages/WebTest-1.4.3-py2.7.egg/webtest/app.py", line 759, in get
    expect_errors=expect_errors)
  File "/home/packer/src/kallithea/.venv/local/lib/python2.7/site-packages/WebTest-1.4.3-py2.7.egg/webtest/app.py", line 1121, in do_request
    self._check_status(status, res)
  File "/home/packer/src/kallithea/.venv/local/lib/python2.7/site-packages/WebTest-1.4.3-py2.7.egg/webtest/app.py", line 1160, in _check_status
    "Bad response: %s (not %s)", res_status, status)
AppError: Bad response: 200 OK (not 302)

----------------------------------------------------------------------
Ran 1482 tests in 311.450s

FAILED (SKIP=2, errors=1)

In a Docker Container

sqlite

I’m betting that these messages are a canary that will help figure out the sqlite failures:

kallithea_1 | not trusting file /code/.hg/hgrc from untrusted user 1000, group 1000
kallithea_1 | not trusting file /tmp/rc_test_lPm4Rl/vcs_test_hg/.hg/hgrc from untrusted user 502, group root
kallithea_1 | not trusting file /tmp/rc_test_lPm4Rl/vcs_test_hg/.hg/hgrc from untrusted user 502, group root
kallithea_1 | not trusting file /tmp/rc_test_lPm4Rl/vcs_test_hg/.hg/hgrc from untrusted user 502, group root

Here’s the full list of error details:

kallithea_1 | ======================================================================
kallithea_1 | ERROR: test_index_with_anonymous_access_disabled (kallithea.tests.functional.test_home.TestHomeController)
kallithea_1 | ----------------------------------------------------------------------
kallithea_1 | Traceback (most recent call last):
kallithea_1 |   File "/code/kallithea/tests/functional/test_home.py", line 43, in test_index_with_anonymous_access_disabled
kallithea_1 |     status=302)
kallithea_1 |   File "/usr/local/lib/python2.7/dist-packages/webtest/app.py", line 759, in get
kallithea_1 |     expect_errors=expect_errors)
kallithea_1 |   File "/usr/local/lib/python2.7/dist-packages/webtest/app.py", line 1121, in do_request
kallithea_1 |     self._check_status(status, res)
kallithea_1 |   File "/usr/local/lib/python2.7/dist-packages/webtest/app.py", line 1160, in _check_status
kallithea_1 |     "Bad response: %s (not %s)", res_status, status)
kallithea_1 | AppError: Bad response: 200 OK (not 302)
kallithea_1 |
kallithea_1 | ======================================================================
kallithea_1 | FAIL: test_create_non_ascii (kallithea.tests.functional.test_admin_repos.TestAdminReposControllerGIT)
kallithea_1 | ----------------------------------------------------------------------
kallithea_1 | Traceback (most recent call last):
kallithea_1 |   File "/code/kallithea/tests/functional/test_admin_repos.py", line 103, in test_create_non_ascii
kallithea_1 |     self.assertEqual(response.json, {u'result': True})
kallithea_1 | AssertionError: {u'result': False} != {u'result': True}
kallithea_1 | - {u'result': False}
kallithea_1 | ?             ^^^^
kallithea_1 |
kallithea_1 | + {u'result': True}
kallithea_1 | ?             ^^^
kallithea_1 |
kallithea_1 |     """Fail immediately, with the given message."""
kallithea_1 | >>  raise self.failureException("{u'result': False} != {u'result': True}\n- {u'result': False}\n?             ^^^^\n\n+ {u'result': True}\n?             ^^^\n")
kallithea_1 |
kallithea_1 |
kallithea_1 | ======================================================================
kallithea_1 | FAIL: test_delete_non_ascii (kallithea.tests.functional.test_admin_repos.TestAdminReposControllerGIT)
kallithea_1 | ----------------------------------------------------------------------
kallithea_1 | Traceback (most recent call last):
kallithea_1 |   File "/code/kallithea/tests/functional/test_admin_repos.py", line 420, in test_delete_non_ascii
kallithea_1 |     self.assertEqual(response.json, {u'result': True})
kallithea_1 | AssertionError: {u'result': False} != {u'result': True}
kallithea_1 | - {u'result': False}
kallithea_1 | ?             ^^^^
kallithea_1 |
kallithea_1 | + {u'result': True}
kallithea_1 | ?             ^^^
kallithea_1 |
kallithea_1 |     """Fail immediately, with the given message."""
kallithea_1 | >>  raise self.failureException("{u'result': False} != {u'result': True}\n- {u'result': False}\n?             ^^^^\n\n+ {u'result': True}\n?             ^^^\n")
kallithea_1 |
kallithea_1 |
kallithea_1 | ======================================================================
kallithea_1 | FAIL: test_create_non_ascii (kallithea.tests.functional.test_admin_repos.TestAdminReposControllerHG)
kallithea_1 | ----------------------------------------------------------------------
kallithea_1 | Traceback (most recent call last):
kallithea_1 |   File "/code/kallithea/tests/functional/test_admin_repos.py", line 103, in test_create_non_ascii
kallithea_1 |     self.assertEqual(response.json, {u'result': True})
kallithea_1 | AssertionError: {u'result': False} != {u'result': True}
kallithea_1 | - {u'result': False}
kallithea_1 | ?             ^^^^
kallithea_1 |
kallithea_1 | + {u'result': True}
kallithea_1 | ?             ^^^
kallithea_1 |
kallithea_1 |     """Fail immediately, with the given message."""
kallithea_1 | >>  raise self.failureException("{u'result': False} != {u'result': True}\n- {u'result': False}\n?             ^^^^\n\n+ {u'result': True}\n?             ^^^\n")
kallithea_1 |
kallithea_1 |
kallithea_1 | ======================================================================
kallithea_1 | FAIL: test_delete_non_ascii (kallithea.tests.functional.test_admin_repos.TestAdminReposControllerHG)
kallithea_1 | ----------------------------------------------------------------------
kallithea_1 | Traceback (most recent call last):
kallithea_1 |   File "/code/kallithea/tests/functional/test_admin_repos.py", line 420, in test_delete_non_ascii
kallithea_1 |     self.assertEqual(response.json, {u'result': True})
kallithea_1 | AssertionError: {u'result': False} != {u'result': True}
kallithea_1 | - {u'result': False}
kallithea_1 | ?             ^^^^
kallithea_1 |
kallithea_1 | + {u'result': True}
kallithea_1 | ?             ^^^
kallithea_1 |
kallithea_1 |     """Fail immediately, with the given message."""
kallithea_1 | >>  raise self.failureException("{u'result': False} != {u'result': True}\n- {u'result': False}\n?             ^^^^\n\n+ {u'result': True}\n?             ^^^\n")
kallithea_1 |
kallithea_1 |
kallithea_1 | ----------------------------------------------------------------------
kallithea_1 | Ran 1479 tests in 281.475s
kallithea_1 |
kallithea_1 | FAILED (SKIP=2, errors=1, failures=4)

Boo… file permissions were not the problem.

I just removed the following from my fig config:

volumes:
 - .:/code/

And added the following to my Dockerfile:

RUN useradd -d /home/kallithea -m -s /bin/bash -u 2000 kallithea
RUN chown -R kallithea /code
USER kallithea

… to ensure that there was no weirdness induced from running as root and running against files that were owned by a different UID than the test process, and I got the same four errors. Something’s up when running in this container.

All the failing tests include the string “non_ascii” in their names.

Let’s see what locale tells us on the virtual machine:

packer@example:~/src/kallithea$ locale
LANG=en_US.utf8
LANGUAGE=en_US:
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=

… and in the Docker container:

kallithea@56a8a9afa48d:/code$ locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

The Docker container doesn’t include en_US.utf8, but it does include C.UTF-8… let’s give that a spin.

kallithea_1 | Ran 1479 tests in 286.144s
kallithea_1 |
kallithea_1 | OK (SKIP=2)
kallithea_kallithea_1 exited with code 0

WOO!

Work on 2014-07-12

The unit tests now run inside of containers managed by fig. I wrote two scripts to facilitate the execution:

  • integration-configs/fig-config-glue.py: reads environment variables set by fig to create a sqlalchemy URL and update an ini file with it.
  • integration-configs/execute_tests.sh: runs the above script, updating test.ini, then sleeps for 10 seconds while the database starts, then runs nosetests

I had been switching between the various databases just by using a different fig configuration file, but that is insufficient. Fig must be invoked with both of the following arguments:

fig -f fig-${DB_TYPE} -p kallithea-${DB_TYPE}

If a project name isn’t specified, fig won’t differentiate between the various database containers (all named “db” in the configs).

Of course, different numbers of tests fail in each configuration (including some sqlite tests that don’t fail when I run it directly on my machine without docker in the middle), so there’s still some testing and adjusting to complete.

Another annoyance: when the tests complete, fig shuts down both containers, but fig’s exit code is always zero even if one of the containers exited with a non-zero return code. Going to ask the team if that’s by design or if they’re open to changing it. As is, I’ll need to either parse the nose output or parse the output of fig ps to give an appropriate exit code to the build server.

Work on 2014-07-11

New info regarding Docker, ENTRYPOINT, and CMD.

Read the CMD docs carefully

I thought I could use ENTRYPOINT to load up environment variables and do some voodoo with the ini file to get the database connection configured correctly:

ENTRYPOINT ["/bin/bash", "--rcfile", "/code/.figrc", "-c"]

So then anything I ran would run after the /code/.figrc loaded. Except that doesn’t happen. Read the CMD docs again, and you’ll see that if the first argument is an executable, it will be executed via the following: /bin/sh -c, so no /code/.figrc.

/me sighs

/me comes back after 15 minutes

These are not general purpose images. They’re specifically to run test suites. Why do I care about them working for every case? I can just write a wrapper script and call it a day.

Work on 2014-07-10

I spent about an hour tonight working on flexible configurations for testing Kallithea against various databases using Fig and Docker

Fig handles some of the dirty work of linking together Docker containers. Linked containers get environment variables set to define endpoints of the other containers. They use Django as an example and things look pretty easy: since the configuration file is written in python, we can just call os.environ.get('DB_1_PORT_5432_TCP_PORT').

No such luck with Pylons and Pyramid, though: there we use an ini file for configuration. I ran into a few bumps, though.

The paster serve command provides an avenue for command line configuration: var=value can be repeated to pass in configuration options on the command line, and the named vars can be referenced in the ini file with %(var)s. That’s good.

Fig doesn’t seem to support environment variables in it’s YAML configuration files, so paster serve test.ini dbhost=$DB_1_PORT_5432_TCP_ADDR results in a literal string “DB_1_PORT_5432_TCP_ADDR” in the configuration. That’s bad, but it can be fixed with a wrapper script.

Kallithea’s setup-db command doesn’t support the same var=value setting on the command line that paster serve supports. That’s bad, but the wrapper script can rewrite the configuration files rather than pass in values via arguments. That’s where I’m leaving off for tonight.

One other dangling question: I tried putting my Dockerfile and Fig yaml configurations in a subdirectory to keep the project root uncluttered, but it didn’t look like Docker liked using .. in place of .. I need to confirm that: there’s a chance that something else was out of line that I didn’t notice.

EDIT: turns out that relative paths really aren’t allowed. That didn’t take long to find.

Work on 2014-07-04

Investigating how to build authentic testing hosts that look and act just like Nodepool does.

When Nodepool is updating base images, it copies all files found at openstack-infra/config/modules/openstack_project/files/nodepool/scripts to /opt/nodepool-scripts.

Each base image type, defined in Nodepool’s YAML configuration file, includes a setup attribute that matches one of the scripts. The matching script is executed in an environment that includes any NODEPOOL_ variables present in the Nodepol daemon’s environment. From all that I’ve seen, this typically only includes NODEPOOL_SSH_KEY. (See jenkins_dev.pp)

So to build replica nodes for personal use, I should just need to copy the scripts to /opt/nodepool-scripts and execute the right one in my packer provisioning configuration.

Work on 2014-06-18

Building a system at work to create disposable machines for developer and QA use that will match production. The process looks like this:

  • Query Foreman for Puppet classes used by a hostgroup
  • Create a Puppetfile with some knowledge about where we keep modules and the names that Foreman provides
  • Run r10k to sync the modules

I did a bunch of work to extract some metadata from the Modulefiles, but that was completely unnecessary and thus stupid. Never hurts to take a step back and draw a process before jumping down a rabbit hole.

I have enough information that I could skip creating the Puppetfile all together and just clone the repo myself, but I think I like the idea of using r10k to install modules from the Puppetfile for two reasons:

  1. It ensures that stale modules are purged gracefully. (I would have just scorched the Earth and deleted the entire modules directory)
  2. It produces an artifact that can be used by other standard tools if necessary.

Work on 2014-06-11

Investigating the Devstack Gate documentation. Seems to be more out of date than I originally expected, or I don’t understand where things run. It mentions a matrix job called devstack-update-vm-image, but reviewing the job list shows that jobs matching that description haven’t run in 9-10 months:

  • devstack-update-vm-image-hpcloud-az1
  • devstack-update-vm-image-hpcloud-az2
  • devstack-update-vm-image-hpcloud-az3
  • devstack-update-vm-image-rackspace
  • devstack-update-vm-image-rackspace-dfw
  • devstack-update-vm-image-rackspace-ord

Looking further, all devstack-update-vm-image and devstack-check-vms- jobs are disabled.

I suspect that all of this work migrated to the Nodepool project.

nodepool.openstack.org contains a directory listing with logs that have recent time stamps. Promising.

The log output (warning, 20MB) seems to confirm my suspicion.

Found the provider configurations minus credentials and the bootstrap scripts.

I sent an email to the openstack-infra mailing list for clarification.