Sunday, January 31, 2010

SLA Consequences to Service Operations

"What, me worry?" (Alfred E. Neuman)

In my previous post I discussed some basic concepts about SLAs, SLOs and penalties. As promised, I am addressing the ‘who cares?’ question.

From a Service Operations point of view, you may shrug your shoulders and claim that these are issues with the Legal and Finance departments. Most likely you were brought on board later in the game and never viewed an SLA until your were forced to do so.

As the person responsible for keeping all the services up and running, it may be best to keep the SLA to a minimum. After all, a document containing vague language, with little commitment and liability would be hard to wave in front of your face when the service levels drop.

I will argue that vagueness will play against you. A tough SLA will require the company to adhere to the high service levels they are committed to, and yes, pay the penalties for breaching these agreements. Keep in mind that if your service level drops one time too many, the legalese you will be hiding behind will not save your butt when customers drop from the service or simply do not renew.

I would take it even one step further. I advocate that the Service Ops managers bonuses are tied to achieving those SLOs that will keep a smile on the customers’ faces. (typically up-time and response time, but in some cases there are other objectives that are crucial to the customers). The carrot and the stick should work nicely to assure that you are doing the utmost to live up to the agreements.

Commitments
Another issue that concerns you (Service Operations) is that Sales are making commitments that you are suppose to keep, usually without you ever knowing about it. Operations needs to initiate a fact finding effort to learn what is there. You need to know what you are capable of providing. Everybody likes to state that they are five nines (99.999% uptime) but how many companies out there really are? You need to monitor and test your service over a substantial period of time before you commit to those numbers.

Another point in favor of having a good grasp of your SLA is that you, as a consumer of services would be conscience of your requirements vis-à-vis your service providers.

That will include the hosting services, your ISP, your communications provider, and whatever cloud services you are using. In my past positions as VP Service Operations I have been appalled by the contracts that my predecessors have signed with service providers. Some of them had no consequences to service level degradation. Others had ridiculous clauses such as 'for every hour of downtime, the credit would be for one hour prorated service cost' which meant that there was no real penalty. Another contract stated that we could get out of the agreement if for three months in a row(!) the service provided was available for less that 75% of the time.
We would have been out of business by then.

Where are those damn SLAs?
As we have seen, SLAs that are broad and meaningful will be complex. Add to that various service levels such as Standard, Gold and Platinum and the fact that some customers have negotiated special terms for themselves, and you are dealing with a mean, slimy problem.

To compound that problem, nine times out of ten, these documents are sitting on someone’s laptop in a PDF format with perhaps a hard copy in a dusty folder, in the cabinet below the espresso machine.

Imagine the exercise of figuring out if an SLA was breached for a particular customer, and if that breach carries a penalty.

I have painfully gone through that exercise too many times, and believe you me - I had much better things to attend to following a service outage. The process was extremely slow, finding the various documents, looking up the terms and comparing the events with them.

Then a calculation was needed as to how much credit was due. And all this was done for a single customer. Multiply that by the number of customers that may have been affected and you have just wasted many good hours of Solitaire.

SLA Management Tools
There are multiple tools out there (some are offered as SaaS) to manage your SLAs. Many of them provide a full cycle of defining SLOs, creating SLAs, generating the documents, monitoring performances against obligations, computing compensation and generating reports. I have not used any of them (although I used to work at an SLM ISV), so am not about to promote any single one, but there are very slick solutions available.

If you are at an early stage, it would be hard sell for you to justify to management that you need to start paying for a service that possibly no one in the company comprehends.

Typically, when a SaaS company starts out there are very simple, non-abiding, fixed SLAs, so there is very little attention paid to this aspect of the business.

As with any aspect of SaaS Service Operations, scalability issues hit you when you least expect them.

As most (all?) SaaS companies do not start with Service Level Management software, by the time it becomes a burden they will have many dozens, or hundreds of such SLAs. The effort of converting them to an automated system is daunting.

Therefore, you can start structuring your existing and future SLAs into a simple excel, or DB so that they are easily accessible, and comparable.

An example of a typical SLA would be stored in a table such as below.
The values for the various SLOs in the table were automatically populated from the definitions in the pre-defined Platinum and Gold tables (which state the default values for these SLAs). They may be overridden by specific values, following negotiations for a particular customer.


Cust.
Calia
Google
Cust ID
123
213
SLSLA
Gold
Platinum
Uptime
99.9
99.99
Response time
under 6 sec
under 4 sec
Support Response time
2 hours
30 min
Support Avail.
12X6
24x7
Major Outage Resolution
1 hour
30 min
Partial outage resolution
4 hours
2 hours
Minor Outage Resolution
12 hours
6 hours
Maint. Notification
10 days
2 weeks
FTP
12 hrs
6 hours
Outage Notif.
Email 1 hours
email + call 30 min


In the book I will elaborate on the structures and the tools and how to automate the compensation computations.