Thursday, August 28, 2014

SaaS Incident Management Best Practices - Epilogue (The STORM™ Series)

“If you want a happy ending, that depends, of course, on where you stop your story.” (Orson Welles)

This is the final post, covering Incident Management in a SaaS Operational Environment.

The previous posts discussed the Prologue, Act I and Act II which covered the preparations and how to react and then how to act in an incident.

As much as one would like to sit back and sip a cool Coke/beer once the incident is over and service is restored, ”it ain’t over till I’s over”. There is much work to be done to get closure.

Notify- Update concerned parties of service restoration

All parties - internal, partners and customers - should be notified that the incident is over.
The first action is to post an all-clear message on the Service Status Page.
Use the same mailing lists (internal and external) that were used to notify of the incident, to spread the good word that service is restored.

Update “Top 10” customers. The concept of the Top 10 is used in other STORM™ practices and is used metaphorically, since the list may contain only seven or twenty seven customers. This is a small group of key functionaries at a select collection of customers with whom the company has developed special relations. These customers tend to be the larger ones or the most profitable ones, or strategic in one way or another. These key players should be handled with Tender Loving Care. Following some incidents, the executives of the SaaS provider would be making the calls if the problem was severe enough and the customer is important enough. The Top 10 must be updated ASAP, so that they won’t hear from other sources of the issues and the resolution. There is a good chance that if the incident was prolonged, some members of the Top 10 were already contacted at the early stages, in the Notify phase.

Record Incident 

Recording the incident has a number of goals/benefits.

  1. Allows the company to manage the SLAs in order to determine who was affected and for how long
  2. An important KPI – allows the company to measure progress across time
  3. Used in the Review and simplifies the creation of the RFO (below)
  4. Enables the company to improve across processes, components and Incident Management in the future.
  5. An important datum of OI – Operations Intelligence - to be used in analytics, prediction and cost cutting.

Each incident must be documented with as much detail as possible and should include the following:

  • Events leading to outage 
  • Time-line of events and actions
  • Components that were affected 
  • Customers, or customer groups that were affected 
  • Resolution - what was done to resolve the problem (may include technical data such as commands that were run) 
  • Indication of full/partial/no outage 

NOTE Any changes to components in the system must be reported to the Asset Management Database and Change Management tables

Review - Post Mortem

The value of the incident review cannot be overestimated. The staff attending the Post Mortem should include anyone that was involved in the incident, or at least a representative from that function. E.g. if multiple CSRs were involved, or multiple Ops engineers, it will suffice if one member of those groups attends, but that person needs to collect all the relevant information from that group’s perspective.
The review should take place as soon as the relevant information at most two days following the incident.
A successful Post Mortem should result in:

  • A clear view of the events, activity and timeline 
  • Understanding the root cause
  • Understanding the damages
  • Analyzing what worked out and what did not during the incident
  • Verifying that Known Problems and Workarounds are updated on the Knowledge Base
  • Verifying that all notifications and customer facing activity has been performed
  • Remediation steps
  • Lessons learned and what should be done differently next time
  • Agreement on the messaging and/or distribution of the RFO (below)

RFO – Reason for Outage

The final step in the closure is filling out and sending the RFO, using a pre-defined template.
The RFO is a document, expected by the customers describing the Incident, the cause and what was done to minimize future re-occurrences.
Sending out the RFO should be done with careful thought. First, the wording and messaging is important. One needs to be as transparent as possible without seeming like complete fools. Also, if only a portion of the customers were affected, perhaps is it wise not to advertise the service degradation to those who had no idea that anything was amiss. This decision is left to company policy or decided per incident at the Post Mortem.

The RFO should include these fields:

  • Meaningful title
  • Date of Outage
  • Time and duration
  • Incident description
  • Affected services
  • Root cause
  • Resolution
  • Next steps

This is the fourth and last article of the Incident Management. The book, when published, will contain more details and useful templates.

May the STORM™ be with you.

Wednesday, August 13, 2014

SaaS Incident Management Best Practices - Act II (The STORM™ Series)

“A thought which does not result in an action is nothing much, and an action which does not proceed from a thought is nothing at all” (Georges Bernanos)

This post is the third, covering Incident Management in a SaaS Operational Environment.
(This post first appeared earlier this week on SaaS Addict)

The previous post covering the initial activities of the incident, discusses the more reactive tasks, namely Detection, Recording and Classification. This post will discuss the proactive stages leading to resolution.

Notification – Inform everybody of the incident.

There are three groups that must be made aware of the incident as soon as it is classified:

  • Internal staff. A predefined list of who gets notified within the company must exist. Whether it is done via email, chat, whatsapp, phone call or carrier pigeon should be determined (ahead of time) according to the classification (urgency and impact). You do NOT want a situation where a major customer informs the sales rep of a problem.
  • Customers. Sometimes the classification of the problem would determine that there are no impacted customers right now and that service could be restored shortly. In this case there is no advantage of creating mass hysteria. The Status-Page (as described in the first post) should be updated first. Now, depending on a many circumstances there are options of sending out an email to all customers, affected customers, highly valued customers, etc. Under a certain set of rules, account managers may call their customers to inform them personally. If the application (is not down and) has a notification box, this a good opportunity to inform actual users of problems.
  • Partners / Channels. Don’t forget your partners. Sometimes in the heat of an incident they are not notified. It may affect them and their customers.
The points I am trying to nail are:
  1. Do not risk having customers discover on their own that there are problems – if they are likely to find out, make sure you are the one informing them.
  2. Try to determine all this activity prior to the incident, not while you’re in the middle of it.

Note: Status Page
This is part of the Notification process, but it merits its own section.
The first Status Page I implemented was at a SaaS provider whose service was business critical. Before we implemented it, each event, real or imaginary, would generate hundreds of calls to the support center. The lines would clog up and the customers would leave angry or frustrated messages. They would try again later and still get the ‘please leave your message’. After the event was over the exhausted CSRs would have to open a helpdesk ticket for every recorded message, and call back the users. This was not only wasted effort and time consuming, but we ended up with many frustrated customers.

Once the Status Page was implemented, it took a few weeks to get the customers used to checking it out and the amounts of calls we got during an incident was reduced by two orders of magnitude!
Keep in mind that the Status page should be updated regularly, with a timestamp attached. Any information that can be provided to the customers will boost their confidence and give them a sense of how soon the problem would get resolved.

Escalate - Get the relevant people working on the problem ASAP

Having planned the Escalation Path in advance, as recommended in the previous post, this should be a straightforward process. Some issues may be resolved by a level-1 operator, but assume that in major incidents everybody will be involved. It is important to stick with the escalation path not to hinder the Investigation process.

It is imperative that an Incident Manager be assigned to the particular event. It may be decided in advance or ad-hoc. The IM gathers the relevant staff in the War Room (below) and manages the whole process, assigns tasks, collects information and ensures that the whole process is recorded.

Investigate - Determine the root cause

As this point we should have the following:

  • An assigned Incident Manager
  • People of relevance gathered together in the ’War Room’, whether physical or virtual
  • Understanding of the problem – what is not functioning
  • Understanding of the impact – who is suffering from it and how urgent is it
  • Understanding of the affected component – sometimes it is obvious from the onset that a major component is down, via monitoring or a report from a service provider, but in complex systems this is not always possible. Sometimes a problem in one sub-system will manifest itself as a problem in another dependent sub-system. The Known Problems in the knowledgebase should be very helpful.
  • Using the Component-Customer mapping as described in the previous post, could be helpful to determine to culprit.
  • Assuming you have been following the practices of the STORM™ Change Management, you would have at your fingertips a query of all changes to the system that were done in the last X hours. There is a very high correlation between changes and failure, so that it safe to assume that the problem will become obvious. Keep in mind that changes should include everything in your production domain including your service providers and your customers.
  • Usage of the Knowledgebase, as described in the previous post might point out to similar cases that were encountered in the past.

Note: War Room
As described in the Prologue, a quiet environment, where only people who might contribute to the process, is vital. The war room, should include up-to-date information on all aspects, and allow open communication between all parties. There should be a single entity, the Incident Manager, running the show, gathering information and assigning tasks to the various participants.
It is important to keep out of the room any person who might add unnecessary pressure and the IM should feel confident enough to kick the CEO out of the room if it is deemed necessary.
Remember that a customer support representative is present as well. The CSRs’ job is to report on any new developments from the customers’ point of view and to communicate to the customer base any progress, preferably through the Status Page.

Restore Service - Allow your customers to continue working

While still in the War Room, the process of restoring the service is done. There are usually three options:
  1. Resolving the problem. Sometimes the issue is straight forward and can be resolved with firing up a backup server, restarting an Windows' service, switching to the last reliable version , or even relaunching the application. If there is a high probability that taking such action could bring the service back within minutes (this is open to interpretation), that is obviously the preferable route. A knowledgebase of Known Solutions would be a great asset at this point. Predefined scripts, as part of the KB, would be even better.
  2. Workaround. When the problem is not well understood and there is no guarantee that any remedial action will bring the service up, or even if it does, there is no guarantee that the problem will not reoccur within a short time, there should be a workaround solution. Such a solution might be a temporary one (such as reverting to the last working version, or database) and may include reduced functionality, but it will at least allow the customers to get back to work, until resolving the problem.
  3. Failover. Assuming redundancy across production systems (locations?) or a DR site is available, there is always the option of failing over to the backup service. This is not an easy decision and not without its costs, but if a workaround is not available and resolving the problem at the production site is going to take long, restoring service to your customers is paramount. 

Throughout the whole process the Status Page should be updated, and obviously, when service has been restored to a satisfactory level, that should be communicated. It is up to the Indecent Manager to verify that this is being dome and up to the CSR to perform that. The Incident Manager should not be assigned with tasks herself, and her only responsibility is to make sure that everything is being documented and calmly coordinate the activity in the War Room.

In the next post – the Epilogue – we will look at the events and activities that take place after the service was restored.

Thursday, August 07, 2014

SaaS Incident Management Best Practices - Act I (The STORM™ Series)

"It is not stress that kills us; it is our reaction to it" (Hans Selye)

This is the second posting of the Incident Management Best Practices. The first part covered the Prologue - the preparations that will improve your survival rate for the next incident that is just waiting to happen.
(This post first appeared on SaaS Addict earlier this week)
The main story contains two acts. This, the first one, deals with the reactive part of the incident. The next post, Act II, will deal with the proactive part of managing an incident.

How Soon?
⦁    How soon did you find out there was a problem?
⦁    How soon did you inform your customers?
⦁    How soon did you know the extent of the problem?
⦁    How soon did you know the impact?
⦁    How soon did you start handling the incident?
⦁    How soon did you get a workaround functioning?
⦁    How soon did you resolve it?
⦁    How soon were you customers informed about the resolution and causes?

Incident management is all about minimizing the damage, doing so in the shortest time possible and taking steps to improve the service going forward.
If you followed the practices recommended in the previous post, you will be much better prepared to deal with whatever hits you, be it as small as a weekly report not being produced  or as bad as a full crash of the system.

ITIL Incident Management
This post does not aim to replace an ITIL certification course (STORM™ was inspired by ITIL and many of the terms are borrowed from ITIL/ITSM), but it follows, to an extent, the activities as they appear in much of the ITSM literature.
The idea behind this approach is to keep a leveled head, to ensure that all details are captured and to recover ASAP.

The stages are:
  • Detect
  • Record
  • Classify
  • Notify
  • Escalate
  • Investigate
  • Restore

Detection - Initiation of the process
When an incident is detected in the organization, it could originate from a monitoring system that alerts the staff just as it happens. In the best case scenario, the problem could be resolved before most customers are even aware of a problem. Alas, the incidents we are discussing usually don’t have this “lived happily ever after” ending.

In reality, many of the really bad problems are those you did not anticipate, and you are made aware of them by a customer calling in, or perhaps a member of the staff noticing something wrong while doing routine work or demoing the product.Often your monitoring system will alert you on an issue that is derived from the real problem, without detecting the problem itself. These cases are misleading and will result in a longer time to resolution.

Regardless of how the issue was detected, the process must begin at the same single starting point - the helpdesk or its equivalent in the company.  It may be a dispatcher, or the support person who is on call at home.

Recording - Keeping track of events

When a nasty problem arises, it may be very low on your priority list to do clerical work, so one may be tempted to postpone this activity to a later stage.
It will become crucial later to have that information and therefore it is important that you pause for a minute and record the information. You might scribble it on a piece of paper to be later entered into the system or entered directly, but capture the following:
  • Date and time of incident first recorded
  • Means of detection (customer call, internal monitoring alert, external monitoring, vendor call, internal staff, etc.)
  • Manifestation – how did the problem appear at first
Recording should continue throughout the incident. The Status Page allows capturing some of the information.

Classification – Determine the impact on customers / components and the urgency.
When the single datacenter is down, it will be rather simple to determine impact and urgency, but many a time, only some customers are affected, or only some functionality is missing.

It is highly important to determine the impact of the incident to allow a proper reaction. Say, you have an affected production system in the UK and your product is mainly a 9-to-5 solution, and it is evening in Europe. The urgency will not be as high as if your East Coast production was down and it is 11:00AM EST right now.
Perhaps the synchronized reporting database is not accessible. It is not as bad as if the transaction database was out of commission, basically shutting down your operation.
The classification will determine many factors in how you manage the incident and therefore it is paramount that you don’t get it wrong.
Remember the ‘customer/component mapping’ from the Prologue posting? This is a good time to utilize it to determine affected component or customers.

The next post, ACT II, will deal with the more proactive aspects, namely Notification, Escalation, Investigation and Restoration