Dani's Perspective on SaaS: SaaS Incident Management Best Practices

“If you want a happy ending, that depends, of course, on where you stop your story.” (Orson Welles)

This is the final post, covering Incident Management in a SaaS Operational Environment.

The previous posts discussed the Prologue, Act I and Act II which covered the preparations and how to react and then how to act in an incident.

As much as one would like to sit back and sip a cool Coke/beer once the incident is over and service is restored, ”it ain’t over till I’s over”. There is much work to be done to get closure.

Notify- Update concerned parties of service restoration

All parties - internal, partners and customers - should be notified that the incident is over.
The first action is to post an all-clear message on the Service Status Page.
Use the same mailing lists (internal and external) that were used to notify of the incident, to spread the good word that service is restored.

Update “Top 10” customers. The concept of the Top 10 is used in other STORM™ practices and is used metaphorically, since the list may contain only seven or twenty seven customers. This is a small group of key functionaries at a select collection of customers with whom the company has developed special relations. These customers tend to be the larger ones or the most profitable ones, or strategic in one way or another. These key players should be handled with Tender Loving Care. Following some incidents, the executives of the SaaS provider would be making the calls if the problem was severe enough and the customer is important enough. The Top 10 must be updated ASAP, so that they won’t hear from other sources of the issues and the resolution. There is a good chance that if the incident was prolonged, some members of the Top 10 were already contacted at the early stages, in the Notify phase.

Record Incident

Recording the incident has a number of goals/benefits.

Allows the company to manage the SLAs in order to determine who was affected and for how long
An important KPI – allows the company to measure progress across time
Used in the Review and simplifies the creation of the RFO (below)
Enables the company to improve across processes, components and Incident Management in the future.
An important datum of OI – Operations Intelligence - to be used in analytics, prediction and cost cutting.

Each incident must be documented with as much detail as possible and should include the following:

Events leading to outage
Time-line of events and actions
Components that were affected
Customers, or customer groups that were affected
Resolution - what was done to resolve the problem (may include technical data such as commands that were run)
Indication of full/partial/no outage

NOTE Any changes to components in the system must be reported to the Asset Management Database and Change Management tables

Review - Post Mortem

The value of the incident review cannot be overestimated. The staff attending the Post Mortem should include anyone that was involved in the incident, or at least a representative from that function. E.g. if multiple CSRs were involved, or multiple Ops engineers, it will suffice if one member of those groups attends, but that person needs to collect all the relevant information from that group’s perspective.
The review should take place as soon as the relevant information at most two days following the incident.
A successful Post Mortem should result in:

A clear view of the events, activity and timeline
Understanding the root cause
Understanding the damages
Analyzing what worked out and what did not during the incident
Verifying that Known Problems and Workarounds are updated on the Knowledge Base
Verifying that all notifications and customer facing activity has been performed
Remediation steps
Lessons learned and what should be done differently next time
Agreement on the messaging and/or distribution of the RFO (below)

RFO – Reason for Outage

The final step in the closure is filling out and sending the RFO, using a pre-defined template.
The RFO is a document, expected by the customers describing the Incident, the cause and what was done to minimize future re-occurrences.
Sending out the RFO should be done with careful thought. First, the wording and messaging is important. One needs to be as transparent as possible without seeming like complete fools. Also, if only a portion of the customers were affected, perhaps is it wise not to advertise the service degradation to those who had no idea that anything was amiss. This decision is left to company policy or decided per incident at the Post Mortem.

The RFO should include these fields:

Meaningful title
Date of Outage
Time and duration
Incident description
Affected services
Root cause
Resolution
Next steps

This is the fourth and last article of the Incident Management. The book, when published, will contain more details and useful templates.

May the STORM™ be with you.

Dani's Perspective on SaaS

Thursday, August 28, 2014

SaaS Incident Management Best Practices - Epilogue (The STORM™ Series)

No comments:

Links