I spent the last few days learning about the openSUSECeph installation process. I ran into some issues, and I’m not
done yet, so these are just my working notes for now. Once complete, I’ll
write up the process on my regular blog.
Prerequisite: build a tool to build and destroy small clusters quickly
I needed a way to quickly provision and destroy
virtual machines that were well suited to run small Ceph clusters. I mostly
run libvirt / kvm
in my home lab, and I didn’t find any solutions tailored to that platform, so
I wrote ceph-libvirt-clusterer.
Ceph-libvirt-clusterer
lets me clone a template virtual machine and attach as many OSD disks
as I’d like in the process. I’m really happy with the tool
considering that I only have a day’s worth of work in it, and I got to
learn some details of the libvirt API and python bindings in the process.
Build a template machine
I built a template machine with
openSUSE’s tumbleweed and
completed the following preliminary configurations:
created ceph user
ceph user has a SSH key
ceph user’s public key is in the ceph user’s authorized_keys file
ceph user is configured for passwordless sudo
emacs is installed (not strictly necessary :-) )
Provision a cluster
I used ceph-libvirt-clusterer to create a four node cluster, and each node had
two 8GBOSD drives attached.
The ceph packages aren’t yet in the mainline repositories, so I added it
to the admin node:
And ceph packages were visible:
First issue: python was missing on the other nodes
When I installed ceph-deploy on the admin node, python was also
installed. The other nodes were still running with a bare minimum
configuration from the tumbleweed install, so python was missing, and
ceph-deploy’s install step failed.
I installed Ansible to correct the problem on all
nodes simultaneously, but Ansible requires python on the remote side, too.
That meant I had to manually install python on the remaining three nodes just
like sysadmins had to do years ago.
Second issue: all nodes need the OBS repository
I didn’t add the OBS repository to the remaining three nodes because I
wanted to see if ceph-deploy would add it automatically. I didn’t expect
that to be the case, but since this version of ceph-deploy came directly from
SUSE, there was a chance.
Fortunately Ansible works now:
Once both of these commands completed, ceph-deploy install worked as expected.
Third issue: I was using IP addresses
ceph-deploy new complains when provided with IP addresses:
In the future, it’d be pretty cool if ceph-libvirt-clusterer supported
updating DNS records so I didn’t need to resort to the host file
ansible playbook that I used today:
Fourth issue: tumbleweed uses systemd, but ceph-deploy doesn’t expect that
Sure enough, a little manual inspection revealed no file at /etc/init.d/ceph and systemd integration:
I learned that this is a known bug,
and I’ll try all of this again with an older version of openSUSE.
… and that’s where I’m calling it a night. I’ll be back at it this week.
Last time I worked on Kallithea’s CI, I got some errors. On a fresh
Ubuntu 14.04 VM without docker, I get the following test results:
In a Virtual Machine
sqlite: 0 errors, 2 skipped
mysql: 0 errors, 2 skipped
postgresql: 1 error, 2 skipped
details:
======================================================================
ERROR: test_index_with_anonymous_access_disabled (kallithea.tests.functional.test_home.TestHomeController)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/packer/src/kallithea-pg/kallithea/tests/functional/test_home.py", line 43, in test_index_with_anonymous_access_disabled
status=302)
File "/home/packer/src/kallithea/.venv/local/lib/python2.7/site-packages/WebTest-1.4.3-py2.7.egg/webtest/app.py", line 759, in get
expect_errors=expect_errors)
File "/home/packer/src/kallithea/.venv/local/lib/python2.7/site-packages/WebTest-1.4.3-py2.7.egg/webtest/app.py", line 1121, in do_request
self._check_status(status, res)
File "/home/packer/src/kallithea/.venv/local/lib/python2.7/site-packages/WebTest-1.4.3-py2.7.egg/webtest/app.py", line 1160, in _check_status
"Bad response: %s (not %s)", res_status, status)
AppError: Bad response: 200 OK (not 302)
----------------------------------------------------------------------
Ran 1482 tests in 311.450s
FAILED (SKIP=2, errors=1)
In a Docker Container
sqlite
I’m betting that these messages are a canary that will help figure out the sqlite failures:
kallithea_1 | not trusting file /code/.hg/hgrc from untrusted user 1000, group 1000
kallithea_1 | not trusting file /tmp/rc_test_lPm4Rl/vcs_test_hg/.hg/hgrc from untrusted user 502, group root
kallithea_1 | not trusting file /tmp/rc_test_lPm4Rl/vcs_test_hg/.hg/hgrc from untrusted user 502, group root
kallithea_1 | not trusting file /tmp/rc_test_lPm4Rl/vcs_test_hg/.hg/hgrc from untrusted user 502, group root
RUN useradd -d /home/kallithea -m -s /bin/bash -u 2000 kallithea
RUN chown -R kallithea /code
USER kallithea
… to ensure that there was no weirdness induced from running as root
and running against files that were owned by a different UID than the
test process, and I got the same four errors. Something’s up when
running in this container.
All the failing tests include the string “non_ascii” in their names.
Let’s see what locale tells us on the virtual machine:
The unit tests now run inside of containers managed by fig. I wrote
two scripts to facilitate the execution:
integration-configs/fig-config-glue.py: reads environment variables set by fig to create a sqlalchemy URL and update an ini file with it.
integration-configs/execute_tests.sh: runs the above script, updating test.ini, then sleeps for 10 seconds while the database starts, then runs nosetests
I had been switching between the various databases just by using a
different fig configuration file, but that is insufficient. Fig must
be invoked with both of the following arguments:
fig -f fig-${DB_TYPE} -p kallithea-${DB_TYPE}
If a project name isn’t specified, fig won’t differentiate between the
various database containers (all named “db” in the configs).
Of course, different numbers of tests fail in each configuration
(including some sqlite tests that don’t fail when I run it directly on
my machine without docker in the middle), so there’s still some
testing and adjusting to complete.
Another annoyance: when the tests complete, fig shuts down both containers,
but fig’s exit code is always zero even if one of the containers exited
with a non-zero return code. Going to ask the team if that’s by design
or if they’re open to changing it. As is, I’ll need to either parse
the nose output or parse the output of fig ps to give an appropriate
exit code to the build server.
I thought I could use ENTRYPOINT to load up environment variables and
do some voodoo with the ini file to get the database connection
configured correctly:
So then anything I ran would run after the /code/.figrc loaded.
Except that doesn’t happen. Read the CMD docs again, and you’ll see
that if the first argument is an executable, it will be executed
via the following: /bin/sh -c, so no /code/.figrc.
/me sighs
/me comes back after 15 minutes
These are not general purpose images. They’re specifically to run test
suites. Why do I care about them working for every case? I can just
write a wrapper script and call it a day.
I spent about an hour tonight working on flexible configurations for
testing Kallithea against various
databases using Fig and
Docker
Fig handles some of the dirty work of linking together Docker
containers. Linked containers get environment variables set to define
endpoints of the other containers. They use
Django as an example
and things look pretty easy: since the configuration file is written
in python, we can just call os.environ.get('DB_1_PORT_5432_TCP_PORT').
No such luck with Pylons and Pyramid, though: there we use an ini file
for configuration. I ran into a few bumps, though.
The paster serve command provides an avenue for command line
configuration: var=value can be repeated to pass in configuration
options on the command line, and the named vars can be referenced in
the ini file with %(var)s. That’s good.
Fig doesn’t seem to support environment variables in it’s YAML
configuration files, so paster serve test.ini
dbhost=$DB_1_PORT_5432_TCP_ADDR results in a literal string
“DB_1_PORT_5432_TCP_ADDR” in the configuration. That’s bad,
but it can be fixed with a wrapper script.
Kallithea’s setup-db command doesn’t support the same var=value
setting on the command line that paster serve supports. That’s bad,
but the wrapper script can rewrite the configuration files rather than
pass in values via arguments. That’s where I’m leaving off for tonight.
One other dangling question: I tried putting my Dockerfile and
Fig yaml configurations in a subdirectory to keep the project root
uncluttered, but it didn’t look like Docker liked using .. in place
of .. I need to confirm that: there’s a chance that something else
was out of line that I didn’t notice.
Each base image type,
defined in Nodepool’s YAML configuration file,
includes a setup attribute that matches one of the scripts. The
matching script is executed in an environment that includes any
NODEPOOL_ variables present in the Nodepol daemon’s environment. From
all that I’ve seen, this typically only includes NODEPOOL_SSH_KEY.
(See jenkins_dev.pp)
So to build replica nodes for personal use, I should just need to copy
the scripts to /opt/nodepool-scripts and execute the right one in my
packer provisioning configuration.
Building a system at work to create disposable machines for developer
and QA use that will match production. The process looks like this:
Query Foreman for Puppet classes used by a hostgroup
Create a Puppetfile with some knowledge about where we keep modules and the names that Foreman provides
Run r10k to sync the modules
I did a bunch of work to extract some metadata from the Modulefiles,
but that was completely unnecessary and thus stupid. Never hurts to
take a step back and draw a process before jumping down a rabbit hole.
I have enough information that I could skip creating the Puppetfile
all together and just clone the repo myself, but I think I like the idea
of using r10k to install modules from the Puppetfile for two reasons:
It ensures that stale modules are purged gracefully. (I would have just scorched the Earth and deleted the entire modules directory)
It produces an artifact that can be used by other standard tools if necessary.
Investigating the
Devstack Gate
documentation. Seems to be more out of date than I originally
expected, or I don’t understand where things run. It mentions a
matrix job called devstack-update-vm-image, but reviewing the
job list shows that
jobs matching that description haven’t run in 9-10 months:
devstack-update-vm-image-hpcloud-az1
devstack-update-vm-image-hpcloud-az2
devstack-update-vm-image-hpcloud-az3
devstack-update-vm-image-rackspace
devstack-update-vm-image-rackspace-dfw
devstack-update-vm-image-rackspace-ord
Looking further, all devstack-update-vm-image and
devstack-check-vms- jobs are disabled.
I suspect that all of this work migrated to the
Nodepool project.
nodepool.openstack.org contains a
directory listing with logs that have recent time stamps. Promising.
The log output
(warning, 20MB) seems to confirm my suspicion.