“Hot stove” lessons, part II: development and operations

I noted last time, once again, that “IT is hard. In fact, it’s so hard that it seems most people have to learn certain core lessons by themselves.  It seems like everyone needs to burn his or her own hand on the hot stove.”  I went through some examples of this sort of “hot stove” lessons particular to management; this time, let’s talk about similar “hot stove” lessons/myths I’ve observed in other IT areas, most notably development and operations.

  • Source code control and release management. One of the traits of a superlative programmer is his or her ability to maintain a complete logical model of the system/program in their head.  The really good ones are in fact really good at this.  Unfortunately, their consummate skill ironically leads to them sometimes resisting tools that help out with some of the logistical pitfalls that arise as system complexity increases.  Source code control is the most important such tool.  Programmers have even told me, “I can just keep track of what I’ve changed and what’s where.”  In a way, I suppose this is a species of the (typically male) trait of refusing to ask for directions—it’s a reluctance to embrace appropriate tools that are designed to avoid common screw-ups and to facilitate overall team success.  Suddenly, though, complexity mushrooms: releases overlap and patches are made in the heat of the moment, and without impeccable source code control and release management, bugs reappear, QA takes longer, and so on.  Typically, I’ve seen this “hot stove” lesson not really get learned by an organization until the source code control failure causes a notable customer-facing issue with a major release.

  • Avoiding reporting against production.  Here, the going-in position of many developers is that “it’s fine to run reports against our production data; we’ll change it if it becomes a problem.”  This is another example of development activities incurring implicit technical debt.  What often happens, of course, is that certain ad hoc reports suddenly become regular and popular throughout the business, with business decisions resting on what they reveal.  New variations get created over time, and overall demand (and frequency of access) soars.  Then, unanticipated and unpleasant ripples occur, often without an obvious link to their cause: system performance problems, unpredictable timeouts, disk shortages, and so on.  When it reaches that point, it’s actually quite difficult to step back and re-architect seamlessly, without a hiccup.  In other words, by the time everyone becomes glaringly aware of the problem, you’ve painted yourself into a corner in terms of the ability to fix it quickly.
  • Separation of development and operations. As companies grow from small numbers of employees to larger enterprises, it’s a common rite of passage for the organization to wrestle with the role of development in ongoing operations [See my post on Speed vs. bureaucracy: management issues confronted by companies in transition.]  Originally in such companies, of course, “everybody did everything,” and people are often proud of and wistful for those halcyon days.  Reducing people’s long-held access to systems is often seen as bureaucratic and authoritarian. But yet it’s necessary. The more you institute a healthy separation of duties and systems, the more likely you ensure clean handoffs from development into production, and the more likely you will avoid the sorts of production problems that ensue when developers make and apply “just one little quick fix” in the heat of the moment.  In truth, minimal developer intervention should ever be needed for production purposes. Developers should have no operational responsibilities and no regular access to production data.  I witnessed one mid-level startup once, where in a crisis moment, the CEO wanted to have a developer code a patch to our enterprise software product in the middle of the night and push it out that night to all other customers as well.  That kind of freewheeling approach and company simply can’t (and didn’t) succeed over time.  In that case, the “hot stove” resulted in the whole company and its investors getting burned.
  • Full disclosure of problems and issues.  Maybe it’s human nature, but I’ve seen a common trend among IT staffers who haven’t learned this particular “hot stove” lesson.  It’s a general feeling that they should keep potentially damaging information quiet, and that “what people don’t know won’t hurt us”.  That philosophy shows up when they resist informing the company widely of system outages (often applying an IT kind of twist to the Five Second Rule).  Or when they are aware of but resist wider debate over critical bugs that they believe won’t actually be noticed by the customer.  Or when system performance problems in testing aren’t escalated, again in the belief that it’ll all turn out to be OK in production environments.  In my IT executive experience, there is no lesson more critical than the need for IT to create and actively encourage an atmosphere of full trust and open disclosure about system issues across the enterprise. All it takes to undermine that trust, and undermine it for a very long time, is one unfortunate instance of IT sweeping something under the rug and getting “caught” at it.  Then, a cycle often ensues: the reaction to the error itself is so strong and painful that IT folks are even more reluctant, the next time around, to disclose anything similar.  IT management can best help here by insisting on airing the dirty laundry, as I’ve discussed before.

Note here that the underlying theme of these “hot stove” lessons is that people resist learning them. Advice and counsel alone do not seem to hammer them home. IT management, if appropriately aware of these pitfalls, can (and should) insist on instituting mechanisms that will prevent the worst of their impacts, but be aware that doing so generally will be against some resistance, both up and down the management chain.  In essence, it may just be that people sometimes need to, well, get burned.

Trackbacks

  1. […] « Speed vs. bureaucracy: management issues confronted by companies in transition “Hot stove” lessons, part II: development and operations […]

  2. […] Developers are performing production-level operational tasks on a regular basis. If you want to deliver new work consistently, you can’t afford to have your developers worrying […]

Speak Your Mind

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.