Scheduled Job Anti-Patterns - Everything is Important

Posted: Monday, 24 Sep 2012

Scheduled jobs tend to suck a bit. They’re usually written after they’re needed and dropped into place with little testing and no plans for fixing them when things go pear-shaped.

This is the first in the series. Here’s the full list that we’ll cover:

Everything is important, email all results
Nothing will break, so why worry
Workflow orchestration with cron
It’s just a script, we don’t need version control

Everything is important, email all the results

The tricky thing about configuring a scheduled job for the first time is making sure it actually works. The job is going to run in the dark recesses of the machine, and you’re going to wonder if it’s really working at first. Email’s a super easy solution to that problem.

Or the sales and refunds processing jobs didn’t run for three days last week because a disk was full. Customers are angry, and all of that bile and vitriol is landing right on top of your boss. These broken jobs are the most important problem in the world right now on the second floor of Spacely Sprockets, and the sternly worded interoffice memo makes that painfully clear through exuberant use of exclamation points and terrible grammar.

Marching orders in hand, every single scheduled job in the environment is configured to send an email upon completion or failure.

The only thing that progresses faster than the email notification configuration is a set of corresponding email rules blazing through the systems and operations group to filter all of that crap out of their inboxes.

Oh, look, we’re back to square one! As my manager likes to say: if everything is important, then nothing is important.

So how do we dodge this?

Find a cron wrapper script that only notifies the team on errors. Cronic makes this super easy to pull off. It’s a shell script, so it should run just about anywhere that matters without compiling anything.
Edit jobs to write their results to a centralized location for intelligent monitoring and notifications. This has its merits, but you’re going to need some additional infrastructure and development time. A big enough shop will justify this. Email can only scale so far.
Dedicate full time staff to sift through email boxes full of cron notifications and manually re-trigger jobs. There are people that actually do this. And if you think it’s a sound solution, then you’re probably in the wrong job. There’s probably a bank someplace that would love to hire you as a computer operator.

What if the notification systems break? Or the scheduler just stops?

Oh, they’ll break. Give it time. They’ll break spectacularly.

The big problem with any job monitoring system is watching the watchmen. Your people or machines have the ability to completely go off the rails and fail at their jobs for an innumerable number of reasons. That’s what sent us down this path of notifying on everything in the first place, remember? People take sick days, and machines have hard drives that fill up. OK, the hard drives shouldn’t fill up, but if you’re so smart, then why are you sending emails from every single one of your scheduled jobs?

Whatever system we use to notify the team of job failures, we also need to ensure that the scheduled job runner and notifications systems have a snug and warm place carved out in the NOC’s monitoring system.

Any decent method for sorting the wheat from chaff plus solid monitoring of the servers that do the grinding should set the course for a sane scheduled job environment.

But wait! There’s more!

This is just the first in the series. Subscribe or come back tomorrow to see more.