devops

Sun, 19 Apr 2015

Build Virtual Machines for Multiple Hypervisors with Jenkins and Packer
Posted: Sun, 19 Apr 2015
Building a virtual machine image for one target hypervisor probably doesn’t cut it anymore. If your organization is like most today, you run VMware in production, and you’re investigating AWS or OpenStack for burstability or a full blown migration, and your developers all run VirtualBox on their laptops. Except for Marvin. Marvin runs KVM because he’s a contributor to the project.

We can define cross-hypervisor machine definitions with Packer.

We can run Packer on a variety of machines to provide full hypervisor coverage with Jenkins.

Packer and Jenkins make a great team. They’re like peanut butter and chocolate. Or coffee and chocolate. Or pastry and chocolate.

/me grabs a chocolate bar from his drawer and gets back on track.

Many organizations use a single Jenkins server to run continuous integration builds, but Jenkins supports any number of slave nodes to distribute load. Further, we can assign labels to those nodes that define capabilities or other characteristics (host OS, installed software, special hardware architecture, etc).

Jobs can be pinned to nodes with specific labels, and the NodeLabel Parameter Plugin lets us use node labels as a parameter when building a job.

I give each of my build nodes a label of packer-${builder-name}, then I can build any machine in my Packer Open Source Appliances repository on all of my available local hypervisors and cloud accounts. My personal list includes qemu, docker, virtualbox-iso, and openstack (with Rackspace).

I’m pretty excited about this, and I’ll share more details and possible next steps in a future post. Until then, you can snag the job definition from my repository.

Build Virtual Machines for Multiple Hypervisors with Jenkins and Packer
- devops
- sysadmin
Tue, 08 Jul 2014
Kallithea CI Service Proposal
Posted: Tue, 08 Jul 2014
Kallithea is a new source code management system based on the GPL origins of RhodeCode. The project needs a continuous integration service running in the open to sufficiently test incoming patches across a matrix of configuration variables. I’ve started looking into adding SSH authentication capabilities, so solid automated testing platform is personally interesting.

I am building a Jenkins server for this purpose. The server build process is automated with Packer and Puppet so that we can easily host the system in a large variety of environments.

Builds will run in Docker containers through the Jenkins Docker Plugin. The plugin makes Docker hosts look like a cloud provider to Jenkins. Since Docker now supports nested containers, each Jenkins slave container can launch any number of other containers to orchestrate multi-node tests.

This approach offers at least three benefits:
- The system can start on a single host
- We can build rich single use test environments that launch quickly
- The system can grow as our testing needs grow
At the least I imagine test jobs to run the following:
- Unit tests
- Integration tests with various infrastructure components (for instance: sqlite, MySQL, PostgreSQL)
- Application upgrade tests
I’m happy to pay for the hosting of the CI service, but the open and automated deployment definitions will allow anyone to build and run their own system as well.

The first scraps of configuration are in the following two repositories:
- puppet-kallitheaci
- packer-template-kallithea-ci
Please let me know if you have any thoughts, questions, or concerns. In addition to the mailing list, you can find me in the #kallithea IRC channel as timfreund.
Kallithea CI Service Proposal
- python
- devops
Tue, 15 Oct 2013

Succumbing to NIH is a Risk
Posted: Tue, 15 Oct 2013
I’m evaluating OpenStack commercial partners, and I’m interested in their ability to blend with our current infrastructure. One vendor touted their automatic provisioning and configuration management system, and I wondered if it was just a re-brand of something like Puppet or Chef under the covers. Puppet would be fantastic since we’re invested in that technology already.

Google pointed me to a clarifying article:

… [PRODUCT] doesn’t rely on third-party products such as Puppet or Chef to handle deployments. He claimed there was risk involved with using such tools, especially with Puppet’s ties to VMWare, and suggested the learning curve was higher using them than with his company’s package.

Risk happens. We dance with risk all day every day. With no more specific information regarding the risks inherent in using an existing provisioning and configuration management system, I must assume those same risks are present using their home grown system. Nobody from Puppet Labs will stop by the home office of this vendor and demand exclusive use of VMWare hypervisors. The vendor is still free to write their own Puppet modules for managing their OpenStack implementations. The vendor is free to ignore Puppet all together while still choosing a completely competent and capable configuration management tool already in use by the community.

Concerning the learning curve problem, there’s only one explanation that makes sense to me. I think he left out some words. I think he meant something like “… the learning curve [for him and his PRODUCT team] was higher using them than with his company’s package.” That makes much more sense to me. It’s a tale as old as software development. I’ve seen it in my own lifetime more than I care to remember. It’s called NIH Syndrome, and it goes a little something like this: “we don’t have time to learn [INDUSTRY STANDARD PACKAGE], so we’re going to write our own instead.”

Because ignoring thousands of person hours worked by experts in the field of that tiny portion of the domain will surely pay off in the long run when your small team of generalists attacks the same problem for two weeks.

I’ve also heard trust arguments against existing components. “We didn’t write it ourselves, so how can we trust it?” Right. A package in use all over the world is obviously untrustworthy. That’s why I write most of my software on a custom home built computer that I soldered together myself. We’re going to need to make a bulk transistor order from Radio Shack so I can build enough equipment to run at scale. It’s the only way to be certain.

Succumbing to NIH is a Risk
- devops
- programming
Sun, 13 Oct 2013

Digital Speedometer just like KITT
Posted: Sun, 13 Oct 2013
KITT captured my imagination when I was four years old. If you weren’t a child of the 80’s, you may not know the depths of KITT’s awesomeness. Gaze carefully at his feature list and try to control the swell of emotions that stir deep within you. A “Molecular Bonded Shell”, a “Third Stage Aquatic Synthesizer”, “Passive Laser Restraint System”, a built-in ATM, and an artificially intelligent personality voiced by Mr. Feeny. I wanted a Firebird or Trans Am for years after seeing KITT in action.

Around the same time that KITT was on television, my dad was shopping for a new vehicle to replace his incredibly crappy Buick Skylark. Although he eventually purchased a totally plain and perfectly suitable Dodge Ram pickup truck, he considered a couple of different options before signing on any dotted lines. I’m fairly certain he test drove a Dodge Daytona, and it mesmerized me. It had a bright, digital dashboard reminiscent of KITT’s. The steering apparatus was a totally boring wheel, but I was willing to overlook it. Not every car can have a yoke.

Decades later I rented a car with a digital dashboard. Once the novelty wore off, I was disappointed with the utility of it all. The two digit speed display required reading. I couldn’t keep tabs on my speed with my peripheral vision quite like I could in cars with analog dials.

An analog gauge, whether physically analog or digitally rendered, provides the moment in time value along with the context of expected and potential ranges. For instance, the tachometer below shows a potential range of 0 through 16 thousand RPM, but the area between 14K and 16K are clearly marked to indicate potential trouble from running within that range:

Sidney Dekker talks about this briefly in chapter 14 of The Field Guide to Understanding Human Error. He claims that numeric digital readouts “can hide interesting changes and events, or system anomalies.”

Why is it, then, that I have produced so many monitoring scripts that produce little more than a two, or three, or four digit number as output? The raw numbers are necessary but not sufficient for use in a production support situation. The numbers must feed into a system that will graphically provide context over time.

One last thing. It’s OK if you got a little misty eyed when KITT was dropped into a pit of acid.

Digital Speedometer just like KITT
- books
- devops
- monitoring
Tue, 17 Sep 2013

F5's BIGIP HTTP Health Monitor
Posted: Tue, 17 Sep 2013
There’s a http monitor in F5’s BIGIP Local Traffic Managers. Common opinion around the office said that any request returning 200 OK was successful in the eyes of the monitor. We found out this isn’t the case today.

We deployed a blank index.html file at the root of the web server since our application runs under a different context, and the health monitor started to fail. Once we put some content in the file (the text “Nothing to see here” in our case), everything turned green again.

This information is probably in the docs. I’m publishing it here mainly for the sake of my own memory.

F5's BIGIP HTTP Health Monitor
- devops
Sun, 15 Sep 2013
Veewee for Ruby Rubes
Posted: Sun, 15 Sep 2013
Let’s talk about installing Veewee to build virtual machines on multiple platforms from templates with little to no ruby background. Ruby professionals will probably do just fine with the install instructions provided by the project. Stick around if you’re a system administrator, operations engineer, or non-ruby developer interested in installing veewee.

We’re going to follow along with veewee’s installation instructions, but I’m also going to provide some commentary and direction as I received it from someone who writes software with ruby on a daily basis.

After installing required operating system packages we come to the first fork in the road. Two options with a silent third option exist for managing our Ruby environment:
1. RVM (preferred by Veewee)
2. rbenv (preferred by Joe)
3. No environment management (preferred by no one sane)
Number three is an option in the technical sense only. Don’t think it’s a wise idea to stink up the system Ruby installation with custom gems. You’re entering a world of pain.

RVM felt a little weird to me, but I chalked that up to being a Ruby rube. Joe told me that he dropped RVM years ago for rbenv and hasn’t looked back. Those two factors pushed me to switch to rbenv, but your mileage may vary. For the python folks in the room, you can equate (RVM || rbenv) + Bundler to virtualenv + pip.

I followed the rest of the installation, command options, and basics pages. Everything worked, and I created several virtual machines using the KVM plugin, but it didn’t quite feel right. I didn’t see anywhere in the docs an explanation of how to separate my work from their work.
Templates and definitions lived inside of the veewee source directory, and “bundle exec veewee” only worked within that directory.

Aside from the ugliness of polluting a directory full of upstream source with my own stuff that I never intended to contribute back, I really wanted a nice way to manage separate repositories of templates and definitions for my personal use and work use.

I saw reference to Veeweefiles in the docs, but not much information on how to use them to customize runtime configuration. Digging in to the code, environment.rb specifically, there was a reference to a VEEWEE_DIR environment variable. Setting that allowed me to run bundle exec veewee from the veewee source directory, but use templates, definitions, and isos in a separate directory. Excited about this new development, I shared the news with some folks in IRC.

Joe again stepped in to let me know that I was just barely catching up to the crowd, and not even in a very ideal way.

“make a gemfile that includes veewee, run bundle it, then the above command [works in your personal directory for templates and definitions]”

I ended up with a directory structure like so:
tim@gauss:~/src/veewee-freund$ pwd
/home/tim/src/veewee-freund
tim@gauss:~/src/veewee-freund$ find .
.
./definitions
./definitions/custom-ubuntu-definition
./definitions/custom-ubuntu-definition/base.sh
./definitions/custom-ubuntu-definition/chef.sh
./definitions/custom-ubuntu-definition/cleanup.sh
./definitions/custom-ubuntu-definition/definition.rb
./definitions/custom-ubuntu-definition/preseed.cfg
./definitions/custom-ubuntu-definition/puppet.sh
./definitions/custom-ubuntu-definition/ruby.sh
./definitions/custom-ubuntu-definition/vagrant.sh
./definitions/custom-ubuntu-definition/virtualbox.sh
./definitions/custom-ubuntu-definition/vmfusion.sh
./definitions/custom-ubuntu-definition/vmtools.sh
./definitions/custom-ubuntu-definition/zerodisk.sh
./Gemfile
./Gemfile.lock
./iso
./templates
./templates/ubuntu-12.04.2-server-amd64
./Veeweefile
Bash
With a very short Gemfile:
source "http://rubygems.org"

gem "veewee"
gem "ruby-libvirt"
Ruby
And a minimal Veeweefile
Veewee::Config.run do |config|

# Initialize convenience vars
cwd = File.dirname(__FILE__)
env = config.veewee.env

# These env settings will override default settings
#env.cwd = cwd
#env.definition_dir = File.join(cwd, 'definitions')
env.template_path = [File.join(cwd, 'templates')]
#env.iso_dir = File.join(cwd, 'iso')
#env.validation_dir = File.join(cwd, 'validation')
#env.tmp_dir = "/tmp"

end
Ruby
And it seems to all work. Just to be clear:
1. Create a new directory for your veewee work.
2. Drop the Gemfile and Veeweefile above in the directory.
3. Run bundle install
4. Optionally alias veewee=”bundle exec veewee”
5. Create templates and define boxes to your heart’s content.
Veewee for Ruby Rubes
- devops
Thu, 12 Sep 2013
Weird Ruby behavior somewhat explained
Posted: Thu, 12 Sep 2013
The djsd binary wasn’t installed with the executable bit set when I installed it from the dotjs-ubuntu repository. I’m running Ubuntu 12.04 on my desktop. Before submitting a drive by pull request I tested 13.04, and it worked fine. At this point I assumed it was either a version difference or something wonky in my environment.

Testing on a clean 12.04 virtual machine pointed directly at my environment. What was the difference? The machines that worked were using the system provided ruby interpreter. The machine that didn’t was using an RVM supplied ruby interpreter. I’m sure there’s more to it than that, but as an infrequent ruby coder that’s all I needed to know before deleting my dotjs-ubuntu fork and calling off the hounds.

And if you aren’t using dotjs, well, you should. It’s awesome. You can inject javascript on a site by site basis. For instance, here’s what keeps me from spending stupid amounts of time browsing imgur:
$('.post a').css('visibility', 'hidden');
$('div.jcarousel ul li a').css('visibility', 'hidden');
JavaScript
That file lives at ~/.js/imgur.com.js, and it removes links to other images on any page served by the site.

It’s a compromise: In the past I’d block the site in my hosts file, but friends would send me links to images. I’d oblige by removing the host entry to view the link then regret that decision later. Will power is a finite resource, and I can’t click on links that aren’t visible.
Weird Ruby behavior somewhat explained
- devops
Thu, 22 Aug 2013
I Bet He Got His Bug Fixed
Posted: Thu, 22 Aug 2013
Gather around kids, and listen to a story about a user looking for help. Years ago there was a bug report submitted to the MPlayer project reporting a crash. The user gave a large amount of data to increase the chances of finding a fix. He gave so much data, in fact, that his bug report garnered 15 minutes of Internet fame because of some explicit detail. A Shark’s Tale was on his play list, and if that wasn’t embarrassing enough, he also had some extremely NSFW material right next to it.

MPlayer’s bugzilla installation migrated to a new host years ago, and I suspect they consciously chose to remove or hide the blush-worthy bug. I also suspect that once the childish giggling died down that guy got his bug fixed. Why? Because he provided detail for examination by the developers.

What happens when a bug or support request arrives without enough detail?

I received a vague bug report this afternoon that didn’t contain enough detail to recreate. Over the course of 20 minutes we extracted enough information to solve the problem, but this was only after the ambiguous bug spent the better part of a month in queue with everyone raising an eyebrow, shrugging, and moving on to more fertile debugging pastures.

There’s hardly a thing as too much detail when submitting support requests. Providing sufficient detail sends two signals:
1. You’ve put forth effort to fix things yourself, and, short of that, collect data for the people that will fix things.
2. You value the time of others. The experts assisting you will know which bits to ignore, but they can’t even get started if they don’t get enough info in the first place.
And if you insist on seeing the forum threads reporting on the crazy MPlayer bug report, just search for: mplayer bug report Sharks Tale. But not at work. Or really ever. It’s silly Internet shenannigans.
I Bet He Got His Bug Fixed
- devops
Fri, 16 Aug 2013

Outsourced Password Hashing
Posted: Fri, 16 Aug 2013
The PassLib New Application Quickstart Guide reveals more information about password hashing than I knew at all before reading the page. Hashing and cryptography are two different things, but, as with crypto code, it’s best to leave password hashing to someone who knows the subject front and back.

I spent some time reading through the hashing routine in an old application to build an administrative password reset tool. We salt, and we use a decent hashing algorithm, but the code isn’t nearly as sophisticated as what we’d get with the default context provided by PassLib.

Here’s a preferential order of password storage algorithms:

-2. Passwords stored in plain text
-1. Passwords reversibly encrypted
0. Passwords naively hashed with a weak algorithm
1. Passwords salted and hashed
2. Passwords salted and hashed with a carefully chosen algorithm and procedure

Users will reuse passwords in your application. Storing weakly hashed or encrypted passwords opens your users’ email, social media, shopping, and banking accounts to fraud and abuse should your application ever be compromised. Users hate that. Don’t let your users hate you over something so easy to do well with the help of open source libraries.

Outsourced Password Hashing
- devops
- programming
- python
Thu, 08 Aug 2013
Get a Routed KVM Network Talking to the Internet with DD-WRT and NAT
Posted: Thu, 08 Aug 2013
Today we configure a DD-WRT router to allow servers on a KVM virtual routed network access to the local network and the internet.

If you’ve found entry #3 in this libvirt networking FAQ and you’re now searching for the solution, you may be close.

First, we need to let the DD-WRT router know about the new network. This is under Setup -> Advanced Routing on my Buffalo device, and hopefully not far from there on your device.

Under “Static Routing” fill in appropriate values:
- Route Name: your choice
- Destination LAN NET: the network address of your KVM virtual routed network.
- Subnet Mask: Probably 255.255.255.0, but you’ll know better if it isn’t.
- Gateway: The LAN address of the host that runs your KVM virtual routed network.
- Interface: LAN & WLAN
Save and apply those settings, then try to ping the router from a virtual machine on the virtual network and vice versa. Assuming both pings work, we’re ready for phase two.

Anything on the LAN can now talk to anything else on the LAN, but the virtual machines on the routed virtual network can’t talk to the Internet yet. Any attempts will time out. This assumes your DD-WRT router does NAT (Network Address Translation) for hosts on your LAN when they access the internet, but you’ll know better if it doesn’t and you probably wouldn’t be reading this right now.

The DD-WRT router knows to apply NAT to any traffic heading to the internet originating from the local network configured in the “Setup” -> “Basic Setup” -> “Network Setup” -> “Router IP” configuration area.
It does not apply NAT rules to any traffic originating on any other network that happens to come its way.

That configuration happens in the “Administration” -> “Commands” area of the administrative interface.
iptables -I FORWARD -s YOUR.KVM.NETWORK.ADDRESS/24 -j ACCEPT
iptables -t nat -A POSTROUTING -j SNAT --source YOUR.KVM.NETWORK.ADDRESS/24 --to-source YOUR.WAN.IP.ADDRESS
Bash
Click “Save Firewall” so the commands will run after each reboot, and click “Run Commands” to execute the commands immediately. Once complete, log in to one of your virtual machines and try to access any site on the Internet. With any luck it now works. I’m perhaps a bit lucky because I have a static IP address at home. I’m not sure how to dynamically update that command to include the correct WAN IP if you use DHCP. I’ll offer up the following bit of shell scripting in case it helps:
netstat -nr | grep ^0.0.0.0 | sed -e 's/0.0.0.0 //' -e 's/ .*//'
Bash
That dumps the routing table and extracts the default gateway for the device. I’ll update this post if you can tell me a better way to get that information.
Get a Routed KVM Network Talking to the Internet with DD-WRT and NAT
- devops
- virtualization
Mon, 29 Jul 2013
devpi For Fast Python Package Installation
Posted: Mon, 29 Jul 2013
(I just upgraded to 1.2.1, and I’m updating this doc accordingly)

Getting Apt-Cacher NG installed with such little fanfare and such huge payoff put me in a mood to cache more stuff.

We have an embarrassment of riches when it comes to caching PyPi packages for a network of hosts. I found four contenders, and I didn’t look very hard. The long activity history as well as the well written documentation pushed me toward devpi.

It’s working so well that I feel no need to try the alternatives. It installed in a virtualenv, and it behaves under upstart management.
# /etc/init/devpi.conf
description "devpi"

setuid www-data
setgid www-data

start on runlevel [2345]
stop on runlevel [!2345]
respawn

exec /opt/devpi/bin/devpi-server --host 0.0.0.0 --refresh 3600 --datadir /opt/devpi/data --secretfile /opt/devpi/.secret
Bash
Then update your pip.conf file:
# ~/.pip/pip.conf
[global]
index-url=http://your_cache_server:3141/root/pypi/+simple/
INI
And your .pydistutils.cfg file:
# ~/.pydistutils.cfg
[easy_install]
index_url = http://your_cache_server:3141/root/pypi/+simple/
INI
(You can run this on your workstation. I have it running on a separate cache server VM because I have several hosts that make use of it.)
devpi For Fast Python Package Installation
- devops
Sun, 28 Jul 2013
Apt-Cacher NG For Fast Package Installation
Posted: Sun, 28 Jul 2013
Disposable systems eliminate the variability and cruft that accumulates over time. I look at some of the long lived dev and test systems at work, and I cringe because a parade of developers has each left their mark, and each special touch pushes those environments further from production. I did it too, and we’re making things better. Please don’t throw stones at my glass house.

CERN’s Tim Bell put it best when he compared long lived systems to pets and disposable systems to cattle.

To make the disposable cattle model work, the orchestration system must create and configure new systems in less time than it’d take to retrofit a long lived pet machine. We won’t wait for new machines just like we won’t wait for a slow unit test suite to run.

Apt-Cacher NG sits on the local network and silently proxies and caches requests for Debian or Ubuntu packages. The first installation of a package will go as quickly as the Apt-Cacher’s connection to the upstream mirror, but subsequent installations will run as quickly as the local network link. That’s typically an order of magnitude or three in throughput.

Apt-Cacher NG is included in the default repositories for Ubuntu, and installation is ridiculously easy. Once it is running, client configuration is also ridiculously easy.

People who use preseed.cfg files (VeeWee, Foreman, or other automatic system provisioning tools) should add or update the following line:
```
choose-mirror-bin mirror/http/proxy string http://<your_apt_cacher_ng_server:3142/
```
If an existing system nees to be reconfigured, place the following in /etc/apt/apt.conf.d/01proxy:
```
Acquire::http::Proxy "http://<your_apth_cacher_ng_server>:3142";
```
Two notes on alternatives:

I did not pick Apt-Cacher NG over Apt-Cacher just because of the NG designation. My first cache server attempt was Apt-Cacher as laid out in the Ubuntu community docs. It installed easily, but it locked up hard when I tried to run apt-get update against it.

I knew of but chose to ignore apt-mirror because I use a very tiny subset of packages. Apt-mirror may be the way to go if you’re supporting a wide range of systems and use cases.
Apt-Cacher NG For Fast Package Installation
- devops
Sat, 27 Jul 2013

Twelve Factor App
Posted: Sat, 27 Jul 2013
I stumbled on to The Twelve-Factor App while reading Docker and the Future of the Paas Layer. It contains 12 steps to guide the development of a robust application that can gracefully transition between servers and deployment environments.

It’s not perfect. It’s not a gospel. But it is pretty solid. Spend the 10 to 20 minutes it will take you to read if deploying code induces cold sweats.

Maybe once you’re done you can convince me that sending all logs to stdout is the right thing to do as outlined in Factor 11. It advocates offloading log management to an external process, and I don’t understand the benefit of that over a mature logging framework built in to the application. I’m not convinced, but I’m not unconvincable.

Twelve Factor App
- devops
Sun, 14 Jul 2013

Monitoring Nuclear Weapons Tests
Posted: Sun, 14 Jul 2013

But every increase in sensitivity resulted in a large increase in the expected number of false alarms. The designers did not understand that in the monitoring of a test ban treaty, a high false-alarm rate would be far more troublesome politically than a low detection rate.

… This is the paradox: too much verification may be as bad as too little.

— Weapons and Hope, p.175

Scientists and engineers worked on the technical implementation of a nuclear test ban treaty with little regard for politics. These professionals made up the top echelon of thought across their areas of expertise, and they aimed for accuracy. No expense would be spared in the quest for accurate detection of nuclear blasts regardless of their size or location.

The rapid escalation inherent in nuclear conflict raised the cost of inaccurate blast detection to such a level that missing a blast would actually be preferable to accidentally classifying an earthquake or other natural event as a blast. Exponentially so if the detection system was ever wired in to a rapid response system, automated or not.

How rapid is rapid escalation? Most of the people reading this grew up in the United States. Think back to grade school. Do you remember the nuclear shelter area in your school if there was one at all? At Queen of the Holy Rosary grade school it was the 400 square foot art classroom. It was in the basement, and there were no windows, but there was also no seal on the door, and no ventilation system, and no stockpile of food or water. It was a tight fit for art class, and there were only 21 kids in my grade. That fallout shelter checked a box on a long and meticulously designed civil defense form, but it sure as hell wasn’t going to provide any shelter from fallout for the 160 kids and 20 staff members in the building if the need arose.

That ridiculously undersized and understocked basement room wasn’t an oversight or a corrupt attempt to save money. It was a conscious decision driven by the strategies and attitudes of the Cold War.

Shelters are perceived to be futile because the assured destruction strategy demands that they be ineffective. [Capable] Shelters are perceived to be threatening because they suggest an intention to make the operation of assured destruction unilateral.

—Weapons and Hope, Chapter 8, p. 90

“A strange game. The only winning move is not to play.”

Monitoring enterprise IT beats us all down at times, but no one dies in the course of our experimentation. Tweet about how much #monitoringsucks all you want. Get it out of your system and then recall that we have the power and time to make things better without killing most of the human population.

Small changes made frequently makes monitoring suck less, so make some changes. Just don’t kill anyone.

Monitoring Nuclear Weapons Tests
- books
- devops
- monitoring
Sun, 07 Jul 2013
Dev and Test, Paste and Gunicorn
Posted: Sun, 07 Jul 2013
I’ve been working under an assumption: WSGI abstracts away all of the details of serving up a python web application so well that WSGI web server implementations are totally interchangeable without thought or worry.

WSGI is awesome, but not quite that awesome. Abstractions are leaky, it’s a law.

IronBlogger development has picked back up thanks to Julython, and the deployment went as planned until I restarted the process. It got stuck in an infinite loop of restarting itself, with exceptions thrown in the database initialization code.

I used the pip freeze command to dump lists of installed software in each host, and gunicorn was the only additional bit of code in the production environment. And, sensibly, the only difference outside of turning off live debugging was the gunicorn configuration:
[server:main]
use = egg:Paste#http
host = 0.0.0.0
port = 6543
INI
[server:main]
use = egg:gunicorn#main
host = 0.0.0.0
port = 6543
workers = 2
proc_name = ironblogger
INI
I haven’t figured out exactly what the problem is, but I’m getting close. This is primarily a public service announcement: test your code with the same server software that it’ll run on in production. Abstractions are leaky, and it may work for a while, but one day it won’t.
Dev and Test, Paste and Gunicorn
- python
- devops
Wed, 01 May 2013
Build Application Packages with FPM
Posted: Wed, 01 May 2013
FPM changed the way I think about packaging. I wrote so many lines of custom deployment code just to skip packaging our in-house application in the past, and it was completely justifiable given my previous experience with packaging tools

I mentioned last time that we’re living in the future. Things have changed for the better.

We now use Puppet for provisioning and configuration management, and that extends to our in-house applications running on those servers. We also have internal package repositories. Packaging pain kept us from fully using this infrastructure for our in-house applications, but FPM changes that.

Say we have a custom application to package. This isn’t an official release, it’s just a build triggered by the latest commit, and we have it laid out on disk just like it should be deployed. Easy:
timestamp=`date +%Y%m%d%H%M`
version=`git rev-parse --short HEAD`

fpm -s dir -t rpm -p ~/ourawesomeappliation-0.9.${timestamp}git${version}.rpm \
--name ourawesomeapplication -v 0.9.${timestamp}git${version} \
--prefix /the/place/to/deploy/it --exclude '*/.git/*' --verbose \
--debug .
Bash
The time stamp ensures that the system can compare versions correctly since git version hashes aren’t alphabetically ordered.

And for the Mercurial projects out there:
timestamp=`date +%Y%m%d%H%M`
version=`hg log . | head -1 | sed -e 's/.* //' -e 's/:.*//'`

fpm -s dir -t rpm -p ~/ourawesomeappliation-0.9.${timestamp}hg${version}.rpm \
--name ourawesomeapplication -v 0.9.${timestamp}git${version} \
--prefix /the/place/to/deploy/it --exclude '*/.hg/*' --verbose \
--debug .
Bash
Note that we’re creating RPM packages above. Also note that the -t argument defines the OUTPUT_TYPE or target. Swap in deb for rpm to get a Debian themed party started. Solaris packages are an option, too. That’s right, some people still deploy more than just Oracle products on Solaris machines.
Build Application Packages with FPM
- devops
Sat, 27 Apr 2013
Disable DHCP on a QEMU/libvirt/KVM Network
Posted: Sat, 27 Apr 2013
Libvirt + QEMU + KVM allows easy virtualization on Linux. Virtual machines are placed on a virtual network that comes complete with a DHCP server and DNS forwarder. There’s no reason to give up these default conveniences until your work involves building and configuring DNS and DHCP servers. Running two DHCP servers on the same virtual network is a recipe for frustration.

The configuration files for libvirt are stored in /etc/libvirt by default, but pretend like they aren’t even there. Read any of the files and you’ll find a warning:

XML
They aren’t kidding. Don’t change these files. There’s one other trick:
Running the net-edit command alone won’t really do anything either if the network is already running. The following operations will do the trick:
root@host # virsh
virsh # net-destroy yournetwork
virsh # net-edit yournetwork
[remove the <dhcp></dhcp> element, save, exit]
virsh # net-start yournetwork
Bash
Rumor has it that later versions of the virt-manager GUI support these operations, but this info will come in handy if you’re still on a LTS system.

Reconnect any hosts that were on the virtual network, because they aren’t anymore. Configure your own DHCP server or configure all hosts with static interfaces as well. Remember that you’re on your own for DNS resolution in addition to DHCP at this point. An easy solution is to set domain-name-servers in your dhcpd.conf file to your router or access point.

Now that we’ve wrapped up our business here, I’d like to take a minute to tell you how happy all this stuff makes me. I’m running a small colony of machines on my middle of the line desktop computer, and I can practice the exact same work that happens in real data centers. Back when I was a newbie every machine was a real, physical machine, and PXE booting required careful hardware selection.

We’re living in the future. We just need the Hoverboard design team to catch up.
Disable DHCP on a QEMU/libvirt/KVM Network
- devops
- virtualization
Tue, 02 Apr 2013
Distractions and Truth In My Shell Prompt
Posted: Tue, 02 Apr 2013
My shell prompt doesn’t lie.

I’m a bash user, so I can set $PS1 to contain a variety of useful information, and my prompt will vary depending on where I’m at. We have enough servers and services at $DAYJOB that the following makes a lot of sense:
```
\u@\h (\t) $
```
That prints:
```
USERNAME@HOSTNAME (TIMESTAMP) $
```
You can learn all about the various bash shell escape sequences if you’d like.

People tell me that showing the time is silly. I like it because there are times when I run a long running process and walk away while it churns. It’s often helpful to know how long something took, and having timestamps on each line let me know.

It’s helpful to me personally as well. There are many times when I’m drawn away from scheduled work into an urgent support situation, and I’ll leave my scheduled work as-is and work the urgent situation in a new terminal tab or window. That’s exactly what happened today, and I was a little shocked at the story the timestamps had to tell:
```
tfreund@somehost (10:28:15) $ the_last_command_i_ran_that_you_dont_care_about.sh
tfreund@somehost (10:28:20) $ the_first_command_i_ran_after_getting_back_at_it.sh
tfreund@somehost (14:38:31) $
```
Over four hours. I’ll admit that wasn’t all urgent support. I ate lunch and checked email, too. It took a look at my important task list to jog my memory back to the work I was doing at 10:28. “Oh, right, that’s what I really need to finish today.”

Working for 8 solid hours is an illusion. It’s not going to happen. Unless work happens in a cabin off the grid away from civilization, interruptions will wreck your day. Almost every day. More important than blocking off 8 solid hours is to put in place systems to continually nudge ourselves toward the important tasks once the storms of unavoidable urgent work have passed.

My favorites include:
- Pomodoro timers (I wrote one, it’s big, and color coded)
- Tasks lists (org-mode, Remember The Milk, or your favorite flavor)
- Liberal use of my calendar to schedule and track work
- Audio that cues me back to the first two items on the list (or a Mountain Stream, or Golden Bowls)
Distractions and Truth In My Shell Prompt
- devops
Sun, 24 Mar 2013

Hand Washing and Checklists
Posted: Sun, 24 Mar 2013

Ignaz Semmelweis, the doctor that recommended hand washing between patients in 1847, died in an asylum. His fellow doctors were appalled that anyone would think they were dirty. They were educated gentlemen. How rude of him to imply that they could contribute to ailments and disease. They were the cure, not the cause!

We’re 166 years out from Dr. Semmelweis’s revolutionary research, and we’re all better off for it. Every desk in my office has a bottle of hand santizer on it. It’s in grocery stores near the dirty, filthy carts. Moms carry it in purses. Yet just 6 months ago the following was published:

Not all doctors want hand-washing reminders

“There will be a day when it will be so automatic for health care workers to clean their hands,” Pittet said. “It will be a lot easier at that time for patients, in case health care workers forgot, to remind them.”

A third of doctors surveyed did not want patients reminding them to wash up. If they were so smart they wouldn’t be sick.

Good thing that smug arrogance is an isolated incident limited to personal cleanliness. Oh, right, cholera:

In 1854 London physician Dr John Snow discovered that [cholera] was transmitted by drinking water contaminated by sewage after an epidemic centered in Soho, but this idea was not widely accepted.

That’s from a Wikipedia article named Great Stink. The name alone warrants a quick skim, and you’ll also read about more doctors ignoring research.

For instance, Filippo Pacini discovered the bacteria that causes cholera in the same year that Dr. Snow theorized that transmission was via sewage, but nobody cared. Physicians knew beyond doubt that cholera wasn’t caused by bacteria, but instead by miasma. That’s bad air for those of us living with bacterial awareness in 2013.

The same bacteria was rediscovered 30 years later by Robert Koch. He also isolated the bacteria behind tuberculosis. I suspect other physicians only took him seriously because he looked like a man that you shouldn’t screw with.

At least now we’re more data driven. With the systematic cost conscious approaches pioneered by modern health management organization, surely now data and procedures rule.

Let’s talk central lines. A central venous catheter as it is called in terms more complex than I commonly use when discussing health and medicine. A patient in need of a central line is in rough shape. Complications from a botched or infected central line spread rapidly by the very nature of the procedure and equipment. Oh, right, central lines.

Dr. Pronovost instituted a checklist for the care and maintenance of central lines. In the broad scope of modern medicine, it’s a relatively simple procedure. Go read it. There are 5 steps. Most of which boil down to “be clean.”

The checklist focused on proper sterilization procedures, and it was so effective that they had to extend the trial period just to believe the results. At the end of the experiment, the hospital estimated that 43 infections and 8 deaths were avoided, to the tune of two million dollars in savings.

This is the stuff that medicine is all about. Sterile drapes and thorough hand washing aren’t sexy like a new MRI machine or remote robotic surgery device, but it saves lives, pain and money.

Despite his initial checklist results, takers were slow to come. … There were various reasons. Some physicians were offended by the suggestion that they needed checklists. Others had legitimate doubts about Pronovost’s evidence. (From The Checklist Manifesto)

Doctors and staff can’t be the cause of trouble, especially if it boils down to cleanliness. They trained to long to make such simple mistakes.

Damn it.

Pilots get this stuff right. They’re creatures of checklist habit, and they’re always, always improving those lists and procedures. Over 12000 B-17 Flying Fortress aircraft were produced, and they played a major part in Allied operations during World War II. This is an amazing accomplishment for a plane that many deemed too complex to fly. Airmen began the process of building and using checklists because of the B-17, and because of the checklists the B-17 became a powerful, flyable, weapon of war.

Why can’t doctors behave methodically like pilots? Is it because pilots put themselves on the line and adherence to methodology is the only thing that brings them back alive?

And why can’t we, the software developers and sysadmins, get it right? Our systems lend themselves to checklists. Scratch that, our systems lend themselves to fully automated checklists in which we should only need to fix the outliers and amend the list. There is a thrill to being a code hero that must outweigh the simple joy of consistently shipping software in a reliable and reportable fashion. At some point that thrill of riding in to save the day cowboy style loses its charm.

We’re getting better, but we’re still more doctor than pilot in our ego and methods.

Hand Washing and Checklists
- books
- devops
Tue, 01 Jan 2013

Fact Driven Development
Posted: Tue, 01 Jan 2013
Six groups of four students huddled around their respective lab tables, waiting for their dissection subject. A cat. A kitty. A dead, dyed, and preserved Felix catus domestica. Some kids hid their anxiety behind smiles, others gaped in silence. Cat dissection was the big deal in Biology II at Bishop Miege High School. Everyone experienced it, even if they never took the class. Formaldehyde wafted through the halls, following Bio II students through the rest of their day.

(The photo is from Flickr. Not my class, or my school, or even my decade…)

I wielded the scalpel about half of the time in my group. I’m not sure if they were conscientious objectors or just grossed out by the smell, but two of our four did as little cutting as possible. As the blade dug through tissue while we traced the route of blood vessels heading to the kidneys, a thought flashed through my mind: “Holy shit! We’re hacking the crap out of this cat, and we have the benefit of dyed blood vessels. Doctors do this to real people? And they live?!” I never had a strong desire to become a doctor, but that cut crossed it off my list for good. I liked my science exact, thank you very much.

Little surprise that I ended up slinging bytes for fun and money. Boolean logic is so orderly, and everything always lines up along powers of 2 or groups of 8. Unambiguous facts lead us through problems and root causes as long as we watch closely. Even the deepest and most difficult problems can be reasoned through with facts derived from direct observation, although we may retreat from the hunt with a sigh and a reboot at times.

There’s a downside to all this exactitude for people who play fast and loose while interacting with tech workers. “All of the processes restarted” is a verifiable statement, trivially so. Don’t think the sysadmin won’t check. “Every transaction is failing” will win no friends within the support ranks once they see that 2 out of 3 transactions are successful. The ISP support line operator knows when you’ve been bad or good, so reboot your modem for goodness sake.

Facts drive the growth and maintenance of technical projects and infrastructure. Gather data, analyze it, and act upon it to move forward. Running on gut feel works only as long as your gut is right.

Fact Driven Development
- devops
- programming
Sat, 13 Oct 2012
Scheduled Job Anti-Patterns - We Don't Need Version Control
Posted: Sat, 13 Oct 2012
Scheduled jobs tend to suck a bit. They’re usually written after they’re needed and dropped into place with little testing and no plans for fixing them when things go pear-shaped.

This is the third in the series. Here’s the full list that we’ll cover:
- Everything is important, email all results
- Nothing will break, so why worry
- Workflow orchestration with cron
- It’s just a script, we don’t need version control
It’s just a script, we don’t need version control

If you ever say these words to me as I’m cleaning up a mess caused by you making a “non-impacting change” to a scheduled job that isn’t under version control, then you’re going to see a look come across my face. Behind that look, my brain is calmly keeping my hands from reaching for the nearest blunt object while mentally filing you away as “someone who’s lucky to have a job in IT” and trying to figure out how to best extract ourselves from the mess we’re in. My brain’s multitasking just isn’t good enough to do those things and maintain a poker face. Sorry.

This is what happens when you say that version control isn’t necessary.
```
-rwxr-xr-x   1 jobsuser jobsuser    6805 Mar  7  2012 sales_prod_export.pl
-rwxr-xr-x   1 jobsuser jobsuser    7.6K Jul 10  2010 sales_prod_export_maint2_2.pl
-rwxr-xr-x   1 jobsuser jobsuser    7.7K Jun 16  2010 sales_prod_export_maint2.pl
-rwxr-xr-x   1 jobsuser jobsuser    7.7K Jun  1  2010 sales_prod_export_maint.pl
-rwxr-xr-x   1 jobsuser jobsuser    2270 Mar 10  2010 sales_prod_export_work.pl
-rwxr-xr-x   1 jobsuser jobsuser   15043 Mar 24  2012 sales_prod_force_send.pl
-rwxr-xr-x   1 jobsuser jobsuser   12564 Aug 26 04:24 sales_prod_verify.pl
-rwxr-xr-x   1 root     root       15659 Jul 19 19:52 sales_prod_verify.pl.bkp
-rwxr-xr-x   1 jobsuser jobsuser   12568 Sep 12 20:32 sales_prod_verify_maint.pl
-rwxr-xr-x   1 jobsuser jobsuser     16K Sep  3  2010 sales_prod_verify_maint2.pl
-rwxr-xr-x   1 jobsuser jobsuser     14K Jun 30 22:02 sales_prod_verify_maint_new_20120629.pl
-rwxr-xr-x   1 jobsuser jobsuser     16K Jun 30 21:55 sales_prod_verify_maint_new.pl
-rwxr-xr-x   1 root     root         16K Mar  1  2012 sales_prod_verify_maint_tim_just_this_once_20110809.pl
-rwxr-xr-x   1 jobsuser jobsuser     16K Aug 10  2011 sales_prod_verify_maint_tim_since_that_is_the_fashionable_thing_to_do.pl
-rwxr-xr-x   1 root     root         16K Jan 29  2011 sales_prod_verify_maint_variable.pl
-rwxr-xr-x   1 jobsuser jobsuser     13K Jun 29  2011 sales_prod_verify_maint_vsmith.pl
-rwxr-xr-x   1 jobsuser jobsuser   16207 Aug 13 15:34 sales_prod_verify-yesterday.pl
```
There are 17 copies of this one script. That’s dumb. This is a fictionalized real world example, so I’m going to stand up straight and take credit for these two here:
- sales_prod_verify_maint_tim_just_this_once_20110809.pl
- sales_prod_verify_maint_tim_since_that_is_the_fashionable_thing_to_do.pl
The sarcasm was lost on this system’s maintainer. Either that, or his brain’s multitasking managed to keep a straight face while he pondered the quickest way to punch me in the nuts and get away with it.

Version Control is too Easy

Version control is too easy not to use for the scripts and configurations that keep your shop humming. You may not trust that statement if your last encounter with version control involved CVS or, to a lesser extent, Subversion, but I swear to you that it is true.

Distributed version control systems let you set up repositories on a local system without ever interacting with a central server. You don’t need to request a new repository from the version control gatekeepers, you just need to init a new repository in place. For bonus points you’re going to back this repository up to a remote location, but that’s outside of our scope here today.

Right now you just need to pick one of the two following tools:
- Git (learn more about git)
- Mercurial (learn more about mercurial)
They both work just about everywhere that matters today, and they behave nearly identically for the stuff we’ll be doing. If you have developers on staff or as friends, ask them what they use. Otherwise, flip a coin.

Once you have the binaries installed, you’re only a few commands away from version controlled bliss. We’ll walk through creating a repository, seeing the status of the repository, committing unsaved changes, and reviewing the change log below.

With mercurial:
```
$ hg init 
$ hg status
? really_important_script.sh
$ hg add
adding really_important_script.sh
$ hg commit -m "Committing really_important_script.sh so we don't lose changes like last time." 
$ hg log 
changeset:   0:0d058a3f5c18
tag:         tip
user:        Tim Freund <tim@freunds.net>
date:        Sat Oct 13 16:09:06 2012 -0500
summary:     Committing really_important_script.sh so we don't lose changes like last time.
```
And with git:
```
$ git init
Initialized empty Git repository in /Users/tim/src/dvcsdemo/git/.git/
$ git status
# On branch master
#
# Initial commit
#
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#   really_important_script.sh
nothing added to commit but untracked files present (use "git add" to track)
$ git add really_important_script.sh 
$ git commit -m "Committing really_important_script.sh so we don't lose changes like last time."
[master (root-commit) 9d9f3d1] Committing really_important_script.sh so we don't lose changes like last time.
 0 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 really_important_script.sh
$ git log
commit 9d9f3d12eaa3d56a9bf0e12364c411baafed2753
Author: Tim Freund <tim@freunds.net>
Date:   Sat Oct 13 16:10:02 2012 -0500

    Committing really_important_script.sh so we don't lose changes like last time.
```
And that’s really all there is to it. Every time you make a change to your scripts, simply commit the changes before you move on to your next task. You’ll eventually want to learn how to revert to previous versions and share your repositories with others, but just doing what we’ve worked through today will set you up for success when making changes to scripts.
Scheduled Job Anti-Patterns - We Don't Need Version Control
- programming
- devops
Thu, 04 Oct 2012
Scheduled Job Anti-Patterns - Workflow Orchestration with Cron
Posted: Thu, 04 Oct 2012
Scheduled jobs tend to suck a bit. They’re usually written after they’re needed and dropped into place with little testing and no plans for fixing them when things go pear-shaped.

This is the third in the series. Here’s the full list that we’ll cover:
- Everything is important, email all results
- Nothing will break, so why worry
- Workflow orchestration with cron
- It’s just a script, we don’t need version control
Workflow Orchestration with Cron

I’m going to show you a small file built primarily for generating support tickets. There’s a crontab file on appserver01 that looks like this:
```
05 0 * * 1-5 /home/vrichards/salesdb/make_the_ach_file.pl 31 > /dev/null 2>&1
25 0 * * 1-5 /home/vrichards/salesdb/edit_the_ach_file.pl 31 > /dev/null 2>&1
27 0 * * 1-5 /home/vrichards/salesdb/copy_the_ach_file.pl 31 > /dev/null 2>&1
```
Sad yet? No? What if I told you there were 30 other sets of such jobs? Still not sad? What about when you log in to appserver02 and find:
```
30 0 * * 1-5 /home/vrichards/salesdb/send_the_ach_file.pl 31 > /dev/null 2>&1
```
Gah!!! On the surface these crontab files are built to generate a file and transmit it to a bank, but each of the following made me die a little bit on the inside:
- Scripts running out of a developer’s home directory
- Magic number parameters (“Tim!”, you yell, “the command has a help function and it’s up to date. That parameter is obvious and self documenting!” You’re a dirty liar, and I hate you)
- Redirecting all output to /dev/null
- Running one process across multiple hosts without justification when all steps could run on one host
Gross. All of it. We aren’t even here to talk about any of the above, but you should get in touch if you’d like to know why any of the above are terrible ideas.

We’re here to talk about trying to orchestrate a complex work flow with individually scheduled cron jobs. We are not here to talk about orchestrating a complex work flow with individually scheduled cron jobs, because that’s not really possible in any realistic way. We are here to discuss trying (and failing) to do the same.

The road to Hell is paved with good intentions

Our completely fictionalized developer, Victor Richards, means well. He really wants to be helpful and write things in a way that makes problem solving easy. Much like breaking up a monster function that spans pages of editor space into multiple easily composed functions, he’s broken up the creation and transmission of ACH files into discrete steps. That’s legitimately thoughtful. It allows the ops team to retransmit a file if the bank’s FTP server was down without regenerating it from scratch.

He’s built all the individual components required to make a fairly robust system that will churn out bank files each business day, but gluing them together with cron is a recipe for pain and an active support queue.

Turning up the heat

Victor’s good intentions turn Hellish toward the end of the month. Website traffic and orders go through the roof. It’s not unusual to double typical order volume at month end, and quarter and year end gets even crazier.

That first job in the chain, make_the_ach_file.pl, gets a full 20 minutes to run in the cron schedule. It’s a pretty beastly script that pulls all of yesterday’s orders out of the database to generate a bank file, and the queries used aren’t optimized all that well. The 20 minute window usually provides a 5 minute cushion before the next job, edit_the_ach_file.pl starts to run, at least until the calendar creeps up to the end of a month. The larger month end order volume means that the file generation either just barely completes in time or is late, and the multi-car pile up begins.

Traffic Control

We have three jobs that are tightly related, with each subsequent job being completely dependent on the job that runs before it. What we need is some way to ensure that the jobs run serially instead of achieving accidental parallelism when one of the jobs runs longer than expected. That turns out to be so easy that you’ll be forgiven for missing the obvious. It just takes one script:
```
#!/usr/bin/env bash
customer_id=$1
/home/vrichards/salesdb/make_the_ach_file.pl $customer_id
/home/vrichards/salesdb/edit_the_ach_file.pl $customer_id
/home/vrichards/salesdb/copy_the_ach_file.pl $customer_id
```
Then our crontab file on appserver01 looks like:
```
05 0 * * 1-5 /home/vrichards/salesdb/ach_file_suite.sh 31 > /dev/null 2>&1
```
As a bonus, we still have the individual scripts available for use during troubleshooting exercises should they be necessary.

This is still the poor man’s form of process orchestration, but we’ve eliminated an entire class of support issues by ensuring that all the individual jobs run in the correct sequence, regardless of how long any of those individual jobs takes to complete.

But wait! There’s more!

This is third in the series. Subscribe or come back tomorrow to see more.
Scheduled Job Anti-Patterns - Workflow Orchestration with Cron
- programming
- devops
Mon, 01 Oct 2012
Scheduled Job Anti-Patterns - Nothing Will Break, so Why Worry
Posted: Mon, 01 Oct 2012
Scheduled jobs tend to suck a bit. They’re usually written after they’re needed and dropped into place with little testing and no plans for fixing them when things go pear-shaped.

This is the second in the series. Here’s the full list that we’ll cover:
- Everything is important, email all results
- Nothing will break, so why worry
- Workflow orchestration with cron
- It’s just a script, we don’t need version control
Nothing Will Break, so Why Worry?

If you’ve spent any time at all running unix systems, you’ve seen the following pattern:
```
26 * * * * /usr/local/bin/fixStaleJobStatus.pl 2>&1 > /dev/null
```
Regardless of the return code or output of that script, you’ll never see a peep from it. All the output is dumped squarely into the bitbucket.

How do jobs end up configured like this? They’re usually written, tested, and watched closely for some period of time: a week, a month, or more. Once the author is convinced that the job is invincible, she gets tired of seeing the output in her inbox each day, so she adds the redirection to make things easy for herself.

Only marginally better are the admins that set up an elaborate maze of mail rules to filter any and all output from scheduled jobs into some rarely read folder. Sure, the output is there, but it may as well not be there if it’s getting ignored completely. Yes, it’s nice to be able to confirm the user error reports that will start to trickle in, sometimes weeks after the problem started, but it’s not exactly the professional way to handle things.

I suppose one step beyond that is the admin that rolls into the office at 10:45, opens Outlook, hits CTRL+A then Delete, and then gets up to retrieve his morning cup of joe. Cron output problems AND user problems solved all in one quick key chord. Shameful, but that dude is so far off the reservation that I’m sure he’s not here among us. Right?

So how do we dodge this?
1. Find a cron wrapper script that only notifies the team on errors. Cronic makes this super easy to pull off. It’s a shell script, so it should run just about anywhere that matters without compiling anything.
2. Edit jobs to write their results to a centralized location for intelligent monitoring and notifications. This has its merits, but you’re going to need some additional infrastructure and development time. A big enough shop will justify this. Email can only scale so far.
3. Wait for the users to call. They know when the important stuff isn’t happening. If a job fails in the server room and no one complains, does it really need to be fixed?
The astute among you will notice that we’re reusing the same list from our last installment. Two serious suggestions and one suggestion that is unfortunately in use in far too many places. These two anti-patterns really are identical twins that were separated at birth. Some might even combine the two into one, but let’s consider them both separately lest we over-correct from one straight into the other.

But wait! There’s more!

This is the second in the series. Subscribe or come back tomorrow to see more.
Scheduled Job Anti-Patterns - Nothing Will Break, so Why Worry
- programming
- devops
Mon, 24 Sep 2012
Scheduled Job Anti-Patterns - Everything is Important
Posted: Mon, 24 Sep 2012
Scheduled jobs tend to suck a bit. They’re usually written after they’re needed and dropped into place with little testing and no plans for fixing them when things go pear-shaped.

This is the first in the series. Here’s the full list that we’ll cover:
- Everything is important, email all results
- Nothing will break, so why worry
- Workflow orchestration with cron
- It’s just a script, we don’t need version control
Everything is important, email all the results

The tricky thing about configuring a scheduled job for the first time is making sure it actually works. The job is going to run in the dark recesses of the machine, and you’re going to wonder if it’s really working at first. Email’s a super easy solution to that problem.

Or the sales and refunds processing jobs didn’t run for three days last week because a disk was full. Customers are angry, and all of that bile and vitriol is landing right on top of your boss. These broken jobs are the most important problem in the world right now on the second floor of Spacely Sprockets, and the sternly worded interoffice memo makes that painfully clear through exuberant use of exclamation points and terrible grammar.

Marching orders in hand, every single scheduled job in the environment is configured to send an email upon completion or failure.

The only thing that progresses faster than the email notification configuration is a set of corresponding email rules blazing through the systems and operations group to filter all of that crap out of their inboxes.

Oh, look, we’re back to square one! As my manager likes to say: if everything is important, then nothing is important.

So how do we dodge this?
1. Find a cron wrapper script that only notifies the team on errors. Cronic makes this super easy to pull off. It’s a shell script, so it should run just about anywhere that matters without compiling anything.
2. Edit jobs to write their results to a centralized location for intelligent monitoring and notifications. This has its merits, but you’re going to need some additional infrastructure and development time. A big enough shop will justify this. Email can only scale so far.
3. Dedicate full time staff to sift through email boxes full of cron notifications and manually re-trigger jobs. There are people that actually do this. And if you think it’s a sound solution, then you’re probably in the wrong job. There’s probably a bank someplace that would love to hire you as a computer operator.
What if the notification systems break? Or the scheduler just stops?

Oh, they’ll break. Give it time. They’ll break spectacularly.

The big problem with any job monitoring system is watching the watchmen. Your people or machines have the ability to completely go off the rails and fail at their jobs for an innumerable number of reasons. That’s what sent us down this path of notifying on everything in the first place, remember? People take sick days, and machines have hard drives that fill up. OK, the hard drives shouldn’t fill up, but if you’re so smart, then why are you sending emails from every single one of your scheduled jobs?

Whatever system we use to notify the team of job failures, we also need to ensure that the scheduled job runner and notifications systems have a snug and warm place carved out in the NOC’s monitoring system.

Any decent method for sorting the wheat from chaff plus solid monitoring of the servers that do the grinding should set the course for a sane scheduled job environment.

But wait! There’s more!

This is just the first in the series. Subscribe or come back tomorrow to see more.
Scheduled Job Anti-Patterns - Everything is Important
- programming
- devops

devops

(I just upgraded to 1.2.1, and I’m updating this doc accordingly)

It’s just a script, we don’t need version control

Version Control is too Easy

Workflow Orchestration with Cron

The road to Hell is paved with good intentions

Turning up the heat

Traffic Control

But wait! There’s more!

Nothing Will Break, so Why Worry?

So how do we dodge this?

But wait! There’s more!

Everything is important, email all the results

So how do we dodge this?

What if the notification systems break? Or the scheduler just stops?

But wait! There’s more!