Thursday, August 20, 2009

Discipline (or lack thereof) and Operational Fatigue

“Half of life is luck; the other half is discipline - and that’s the important half, for without discipline you wouldn’t know what to do with luck”- Carl Zuckmeyer

Creative and nonconformists
SaaS companies are mostly composed of a group of highly capable software engineers. These techies are, by nature, creative, imaginative, out-of-the-box engineers, inventing new ideas or new ways of achieving better results. They tend to adopt the latest and greatest technologies and are always looking forward to the next best thing.
Naturally, these engineers are nonconformists and not inclined to follow rules or to stick to routine.
Almost always, they do not come from an enterprise IT environment, where rules and regulations are stricter and operational practices are followed almost religiously.
With the nascent state of SaaS, if the engineers have prior experience, it would mostly come from on-premise, product companies that emphasize features, versatility and usability.
They rarely had to deal with customers, and bugs that were found were handled according to their priority to be fixed in the next release (which could be months away).
Therefore, typical SaaS engineers lack the necessary discipline to run a 24X7 service, and are usually hostile to restrictions imposed on them.

What, me worry?
The lack of discipline manifests itself mainly in Change Management and consequently in Asset Management and, then, consequently in Incident Management.
This refers to what changes are allowed to be done when (‘hey, just to let you guys know, I installed the new patch during lunch break’), how are they approved and communicated (‘yeah, no prob, I tested the code on my laptop – it is foolproof, just a small change in the parsing engine’) how they are recorded and rolled back if necessary (‘don’t worry, I keep all changes in a dedicated notepad on my machine’).
There usually are no rules about touching production. Typically, every engineer has full SUDO access to all servers in the data center, using a single super-user login, so that activities cannot be traced to any specific person.
One-offs can be installed on a particular server and not be documented. Months later when a new version is installed or a server replaced, things fail to work and it may take hours for someone to remember that a special component is not functioning any more.
Lack of a fully functional staging environment may cause an engineer to ‘temporarily test’ some feature on a production machine that either causes service disruption or is forgotten until the fan turns brown.


Operational Fatigue
Operational Fatigue is a term I coined after years in the trenches, of waking up at 3:00 AM to deal with the same problem that hit us three weeks ago; of the stress of dealing with an incident at peak time when Management is hysterical, when Sales are complaining, when Support is overwhelmed with frustrated customers; of making the calls to the high profile customers, explaining, apologizing, promising; of having to explain to the Board why we lost so many customers this quarter.
It gets to you. You discover new gray hair and develop a fear of answering the phone.

The point is – it is avoidable. Instilling the practices and discipline can make life so much easier and allow the ops team to plan and improve instead of fighting fires all the time.

Educating the young
Like toddlers, engineers crave for guidance and discipline, but as most parents would testify, they will make every attempt to break the rules and stretch the envelope to test the boundaries of their environment. Experienced parents will tell you that the young children feel much more secure when they know the rules and when the rules are being enforced. It has been my experience that when I introduced a new set of regulations such as in Change Management, there is always an initial push-back, mumbling about bureaucracy and attempts to circumvent the rules in the beginning. But I have always seen a quick adoption of the new regulations, followed by a realization that life would be so much better if we only stick to the rules – these guys are smart, you know. Many a disaster was avoided by playing the game by the new rules and I found out how quickly the engineers embraced the discipline and started devising ways to improve on and automate the processes.

Just do it!
I recently participated in a round table hosted by HP on the subject of Change Management. Most of the participants were from large IT shops and were talking about adapting to new Change Management processes in terms of six to twelve months. I was astonished. I concede that my background has been with much smaller groups, and I had the full backing of the executive management, but twelve months? Jeez!

The process in my experience was:
· Prepare the documents, templates and work-flows.
· Make a compelling Power Point presentation.
· Present to the Engineering, Ops and Support groups.
· Emphasize the consequences of not following the practice (genitalia hanging at high altitude)
And Voila - It works! A few weeks later you have a spiritual following of admirers, because the fruits of the labor are so obvious in a very short time.


6 comments:

Guy Nirpaz said...

Dani,

Interesting post.
I found a lot of resemblance between Ops management and development management.
Engineers are 'good doers' by nature. Nobody is trying to harm any system and distabalize it.
It's a matter of local vs. global optimizations.
Local optimization in many cases result in worse system service level performance.
Once teams have enough transparency and visibility they will make the right choices.

Anonymous said...

Genial dispatch and this mail helped me alot in my college assignement. Thank you for your information.

Anonymous said...

Easily I assent to but I think the post should have more info then it has.

Dani Shomron said...

I could agree, but I think the comment should have more info than it has.
What kind of info are you looking for?

David said...

I'm sold on standard change management processes, whether conventional or more agile based. The same goes for incident management. They are two critical parts of the operational process. What I am really interested in, though, is how to manage time and workflow for the operations teams. Do you have any blogs on that?

David said...

I'm already sold on change management (and incident management, by the way). What I am really looking for is a better tool than ticketing to track workflow in SaaS operations teams. I've tried tickets, I've tried Kanban. Scrum doesn't really work. Do you have any blogs or musings on this topic?