Thursday, August 20, 2009

Discipline (or lack thereof) and Operational Fatigue

“Half of life is luck; the other half is discipline - and that’s the important half, for without discipline you wouldn’t know what to do with luck”- Carl Zuckmeyer

Creative and nonconformists
SaaS companies are mostly composed of a group of highly capable software engineers. These techies are, by nature, creative, imaginative, out-of-the-box engineers, inventing new ideas or new ways of achieving better results. They tend to adopt the latest and greatest technologies and are always looking forward to the next best thing.
Naturally, these engineers are nonconformists and not inclined to follow rules or to stick to routine.
Almost always, they do not come from an enterprise IT environment, where rules and regulations are stricter and operational practices are followed almost religiously.
With the nascent state of SaaS, if the engineers have prior experience, it would mostly come from on-premise, product companies that emphasize features, versatility and usability.
They rarely had to deal with customers, and bugs that were found were handled according to their priority to be fixed in the next release (which could be months away).
Therefore, typical SaaS engineers lack the necessary discipline to run a 24X7 service, and are usually hostile to restrictions imposed on them.

What, me worry?
The lack of discipline manifests itself mainly in Change Management and consequently in Asset Management and, then, consequently in Incident Management.
This refers to what changes are allowed to be done when (‘hey, just to let you guys know, I installed the new patch during lunch break’), how are they approved and communicated (‘yeah, no prob, I tested the code on my laptop – it is foolproof, just a small change in the parsing engine’) how they are recorded and rolled back if necessary (‘don’t worry, I keep all changes in a dedicated notepad on my machine’).
There usually are no rules about touching production. Typically, every engineer has full SUDO access to all servers in the data center, using a single super-user login, so that activities cannot be traced to any specific person.
One-offs can be installed on a particular server and not be documented. Months later when a new version is installed or a server replaced, things fail to work and it may take hours for someone to remember that a special component is not functioning any more.
Lack of a fully functional staging environment may cause an engineer to ‘temporarily test’ some feature on a production machine that either causes service disruption or is forgotten until the fan turns brown.

Operational Fatigue
Operational Fatigue is a term I coined after years in the trenches, of waking up at 3:00 AM to deal with the same problem that hit us three weeks ago; of the stress of dealing with an incident at peak time when Management is hysterical, when Sales are complaining, when Support is overwhelmed with frustrated customers; of making the calls to the high profile customers, explaining, apologizing, promising; of having to explain to the Board why we lost so many customers this quarter.
It gets to you. You discover new gray hair and develop a fear of answering the phone.

The point is – it is avoidable. Instilling the practices and discipline can make life so much easier and allow the ops team to plan and improve instead of fighting fires all the time.

Educating the young
Like toddlers, engineers crave for guidance and discipline, but as most parents would testify, they will make every attempt to break the rules and stretch the envelope to test the boundaries of their environment. Experienced parents will tell you that the young children feel much more secure when they know the rules and when the rules are being enforced. It has been my experience that when I introduced a new set of regulations such as in Change Management, there is always an initial push-back, mumbling about bureaucracy and attempts to circumvent the rules in the beginning. But I have always seen a quick adoption of the new regulations, followed by a realization that life would be so much better if we only stick to the rules – these guys are smart, you know. Many a disaster was avoided by playing the game by the new rules and I found out how quickly the engineers embraced the discipline and started devising ways to improve on and automate the processes.

Just do it!
I recently participated in a round table hosted by HP on the subject of Change Management. Most of the participants were from large IT shops and were talking about adapting to new Change Management processes in terms of six to twelve months. I was astonished. I concede that my background has been with much smaller groups, and I had the full backing of the executive management, but twelve months? Jeez!

The process in my experience was:
· Prepare the documents, templates and work-flows.
· Make a compelling Power Point presentation.
· Present to the Engineering, Ops and Support groups.
· Emphasize the consequences of not following the practice (genitalia hanging at high altitude)
And Voila - It works! A few weeks later you have a spiritual following of admirers, because the fruits of the labor are so obvious in a very short time.

Thursday, August 13, 2009

Transparency in SaaS Service Operations

“Life is filigree work. What is written clearly is not worth much, it's the transparency that counts.” - Louis-Ferdinand Celine

Companies like to boast about their transparency, but in practice, information dissemination is highly controlled. At an on-demand company, hiding the backstage operations seems like a smart thing to do. As long as you are servicing the customer, and as long at the customers do not complain, why should you wash your dirty laundry in the public?
So what about SLAs? The guiding principle seems to be ‘Don’t worry about them if your customers do not demand them’. And even when they do, there are SLAs and then there are SLAs. There are so many ways to interpret these elusive numbers (assuming you even know the real ones) that most companies will portray better results than those that reflect reality.

Varying degrees
There are different modes of Transparency communications; from the non existent to the reactive, the proactive and full disclosure.

The reactive type is the common case where there are service disruptions and customers call in to complain. In this case you will determine how much information you would like to divulge. This could be done with a customer call, an RFO (Reason for Outage) that is sent to particular customers or a message on the corporate site.

A proactive approach would have a Service Status Page depicting the current service availability of the various production systems.

A full disclosure mode will provide customers with a historical view of production systems availability and response time such at Salesforce’s Trust or SAManage’s Status Page .

Advantages of Transparency
My experience has been that the more transparent you are with your customers, the better relationship you will foster with them and the more forgiving they will be when things turn sour. And things do turn sour; it is unavoidable.
Your customers are not dumb (in general, that is – I can relate many amusing stories of individuals that should have not been awarded fourth grade graduation, but that is another story). The people on the other end generally understand that you are dealing with a complex environment with many factors that are not always under your control. They will be willing to accept that scheisse happens, but they also must know that you are ready to accept responsibility and learn from these events. There should be a closure process for each event including Incident Recording, Post Mortem, RFO communication (more on that in Incident Management).
Of course, nothing beats a good, reliable, available and responsive service. If you are not able to provide that, you will end up loosing your customers regardless of how much camouflage and finger pointing are used to cover the smell.

How transparent should you be?
I am not advocating that you have to run out and tell the guys every time you messed up or that you should bombard the customers with a technical exposition as part of the RFO document.
Striking the balance is an art that comes with practice and common sense. If an incident occurred that did not disrupt services, you must undergo the full Incident life-cycle practice to ensure that lessons are learned and the incident will not repeat. But you do not necessarily have to go and boast about it.
As for the RFO, in my days I have been asked to put my signature on many customer facing documents that had a bland, general, canned message that meant nothing to the reader. (“service was lost do to a system failure”). I realized that customers will not trust the messaging and choose to either ignore it while snorting in disgust or have a techie call in and start drilling the poor customer service rep for technical details which would be hard to provide.
I have also seen RFOs that contained multiple pages and read like a PHd dissertation in electronic engineering. I do not know who approved these RFOs and if the purpose was to wear down the suffering reader so that further RFOs will never be requested.

Company Culture
And finally, keep in mind that if the company’s culture tolerates half-truths and spins when facing the customer, you run the risk of it percolating through the company’s internal activities and reports. Don’t you expect your employees to be truthful, accountable and not shy away from reporting mistakes, even if it makes them look not too great? Your customers have to expect your company to do the same. And, if the results of truthful reporting will cost you a customer then something was probably wrong with the relationship to begin with, and the customer may have been looking for an excuse to break away.