Scheduled Job Anti-Patterns - Workflow Orchestration with Cron
Tags: programming devops
Scheduled jobs tend to suck a bit. They’re usually written after they’re needed and dropped into place with little testing and no plans for fixing them when things go pear-shaped.
This is the third in the series. Here’s the full list that we’ll cover:
- Everything is important, email all results
- Nothing will break, so why worry
- Workflow orchestration with cron
- It’s just a script, we don’t need version control
Workflow Orchestration with Cron
I’m going to show you a small file built primarily for generating support
tickets. There’s a crontab file on appserver01
that looks like this:
05 0 * * 1-5 /home/vrichards/salesdb/make_the_ach_file.pl 31 > /dev/null 2>&1
25 0 * * 1-5 /home/vrichards/salesdb/edit_the_ach_file.pl 31 > /dev/null 2>&1
27 0 * * 1-5 /home/vrichards/salesdb/copy_the_ach_file.pl 31 > /dev/null 2>&1
Sad yet? No? What if I told you there were 30 other sets of such
jobs? Still not sad? What about when you log in to appserver02
and find:
30 0 * * 1-5 /home/vrichards/salesdb/send_the_ach_file.pl 31 > /dev/null 2>&1
Gah!!! On the surface these crontab files are built to generate a file and transmit it to a bank, but each of the following made me die a little bit on the inside:
- Scripts running out of a developer’s home directory
- Magic number parameters (“Tim!”, you yell, “the command has a help function and it’s up to date. That parameter is obvious and self documenting!” You’re a dirty liar, and I hate you)
- Redirecting all output to /dev/null
- Running one process across multiple hosts without justification when all steps could run on one host
Gross. All of it. We aren’t even here to talk about any of the above, but you should get in touch if you’d like to know why any of the above are terrible ideas.
We’re here to talk about trying to orchestrate a complex work flow with individually scheduled cron jobs. We are not here to talk about orchestrating a complex work flow with individually scheduled cron jobs, because that’s not really possible in any realistic way. We are here to discuss trying (and failing) to do the same.
The road to Hell is paved with good intentions
Our completely fictionalized developer, Victor Richards, means well. He really wants to be helpful and write things in a way that makes problem solving easy. Much like breaking up a monster function that spans pages of editor space into multiple easily composed functions, he’s broken up the creation and transmission of ACH files into discrete steps. That’s legitimately thoughtful. It allows the ops team to retransmit a file if the bank’s FTP server was down without regenerating it from scratch.
He’s built all the individual components required to make a fairly robust system that will churn out bank files each business day, but gluing them together with cron is a recipe for pain and an active support queue.
Turning up the heat
Victor’s good intentions turn Hellish toward the end of the month. Website traffic and orders go through the roof. It’s not unusual to double typical order volume at month end, and quarter and year end gets even crazier.
That first job in the chain, make_the_ach_file.pl
, gets a full 20
minutes to run in the cron schedule. It’s a pretty beastly script
that pulls all of yesterday’s orders out of the database to generate a
bank file, and the queries used aren’t optimized all that well. The
20 minute window usually provides a 5 minute cushion before the next
job, edit_the_ach_file.pl
starts to run, at least until the
calendar creeps up to the end of a month. The larger month end order
volume means that the file generation either just barely completes in
time or is late, and the multi-car pile up begins.
Traffic Control
We have three jobs that are tightly related, with each subsequent job being completely dependent on the job that runs before it. What we need is some way to ensure that the jobs run serially instead of achieving accidental parallelism when one of the jobs runs longer than expected. That turns out to be so easy that you’ll be forgiven for missing the obvious. It just takes one script:
#!/usr/bin/env bash
customer_id=$1
/home/vrichards/salesdb/make_the_ach_file.pl $customer_id
/home/vrichards/salesdb/edit_the_ach_file.pl $customer_id
/home/vrichards/salesdb/copy_the_ach_file.pl $customer_id
Then our crontab file on appserver01
looks like:
05 0 * * 1-5 /home/vrichards/salesdb/ach_file_suite.sh 31 > /dev/null 2>&1
As a bonus, we still have the individual scripts available for use during troubleshooting exercises should they be necessary.
This is still the poor man’s form of process orchestration, but we’ve eliminated an entire class of support issues by ensuring that all the individual jobs run in the correct sequence, regardless of how long any of those individual jobs takes to complete.
But wait! There’s more!
This is third in the series. Subscribe or come back tomorrow to see more.