Nightmares before Halloween: bad dreams of the CTO/CIO

In honor of the season, I thought I’d share a few recurring nightmares, ones that unfortunately don’t seem to confine themselves to the fall time frame. All of these are chronic worries that have truly kept me up at night; most of them stem from actual real-life situations I’ve encountered.

1. Your CEO calls you and asks you why the web site is down… and you didn’t know it was!

When the company’s web site (or any other mission-critical system) is down, escalation mechanisms need to inform you and inform you fast. Of course, the site should rarely / never be down other than for scheduled maintenance, so putting yourself in that notification loop (subject to calls in the middle of the night) shouldn’t be too common and painful. If you’re not informed of these situations, I’d argue that either your team isn’t sufficiently on top of detecting them, or they’re “sparing you the pain” of being told. In truth, the pain has to be spread around. The onus of notifying management is one mighty incentive to make sure that the need to do so arises as seldom as possible. Don’t tolerate being part of (much less at the helm of) an organization that purposely or through omission sweeps things under the rug.

2. Sudden night sweats: “I’ve spent too much on infrastructure!” Or, just as bad, “I’ve spent too little on infrastructure!”

Do you know for sure that you’re not in one or the other of these situations? Are you really prepared to scale to a degree that reflects with your growth? Do you have metrics and processes that support a methodical capacity plan in general? In my view, the “C” in CTO should stand for “Cheap”, because it’s remarkably easy to be the Expensive Technology Officer, rather than the Cheap Technology Officer. But in truth, the C should also stand for “Commensurate”: i.e., you want to build for your anticipated growth, but not be too far ahead of that curve. Anything else (too little, too much) is the making of nightmares.

3. Your software architecture looks fine now but won’t scale.

This scenario covers the possibility that one day you’ll cross some unknown threshold (data volume, for example) where response time or other key metrics go massively and suddenly south. How do you mitigate this nightmare? Well, the paranoid among us will never lose it entirely, but the best (perhaps the only) approach you can take is to conduct intense design reviews, evaluating for potential performance and throughput bottlenecks that could crop up. Another approach (often ignored) is to plan proactively for data “deadwooding” operations. Example: a high-volume web site that has people register should periodically eliminate those registrants who never logged in again within (say) a year of registering. Business users will often clamor for retaining everything, but there’s a real (and looming nightmare) cost of doing so.

4. The system goes down and you … can’t…. get…. it…. back…. up.

Today’s systems are frightfully complex, with lots of interconnecting moving parts. I’ve had times, while sitting through a middle-of-the-night crisis conference call discussing a badly non-functional system, where this kind of desperation started to set in. No matter what we tried, the system wouldn’t respond. At times like this, you start to wonder if it ever will. Mitigation? Remember that it always does, one way or another. It may not help the nightmare. This one’s a doozy.

I commiserate with all of you who may have experienced these and others. Sominex only goes so far to soften the impact of these nightmares. Rather, careful and methodical risk mitigation planning beats pharmaceuticals every time, in my experience.

Lagniappe:

Comments

  1. Worse is getting word from your boss that the site has been hacked. I had forgotten to restrict commenting on our Joomla site to registered users, so we were getting spammed by some pretty salacious, not-appropriate-for-high-schools sources. Not a real great way to showcase your CMS solution for the boss to see, I tell ya.

    My nightmare, which I have already lived through twice, is for the Exchange server to crash. The first time, with Exchange 5.5, I was able to recover using exmerge. The second time, with 2000, we had to do a brick-level restoration when a power glitch chunked the hard drive. I now back up those Exchange databases and logs three different ways, but the nightmare hasn’t gone away.

Speak Your Mind

*