Tuesday, October 25, 2005

Automation for a Living

I get to work on a lot of automation projects. We get a file from somewhere, do something with the file, and then drop another one.

The trouble with working on automation is the very fact that you are trying to eliminate humans from the equation means that when something breaks, the designer tends to be the one who gets contacted.

Documentation and logging are two ways to address this problem. A clear explanation of how the process works will help troubleshoot errors (in theory). However, anyone who has worked in this business knows that documentation is anathema: people don't like to write it, and even if it gets written nobody wants to read it.

For this reason I try to make the documentation that I write at least marginally amusing so that on the outside chance that somebody comes along and actually reads page 1, they'll read the rest.

Today we had several "problems" with a layer of our overnight automation. All of those problems were pretty reasonably explained in the processing log files. Basically the failures occurred OUTSIDE of the automation layer that falls under our umbrella of responsibility.. our processes fall in the middle of a dependency chain, and links in that chain broke.

What I'm getting at is that working on automation is a lot like being the New England Patriots. When you win (everything works).. well, that's what is expected to happen. But when you lose (something doesn't work), everybody starts calling sports talk radio and declaring the end of the world.

That means that the best thing to do is write manual harnesses around EVERYTHING and make somebody push a button so that you can shift blame on them when something goes wrong.

*shudder*

I feel dirty for just thinking that.

2 comments:

Anonymous said...

This sounds like stuff Brian Billick would say if he was a computer programmer instead of a football coach.

Anonymous said...

It's a model I've been slow to buy into but I think you're right. We need tools to 1) allow more and *immediate* manual intervention by OPS, and 2) allow *us* to say "hey, why didn't you run the backup when you saw a problem in automation??"

(sigh)

In the words of the Great Metallica... "... you know it's sad but true..."