Thursday, November 12, 2009

SLA Management for SaaS

“God does not ask about our ability, but our availability.” (Source unknown)

(Yet another chapter in the book - keep the feedback coming!)

As the second ‘S’ of SaaS indicates, the on-demand company is all about providing a service and therefore one would expect Service Level Agreements to be well defined and understood in this industry, but the facts tell another story. Few SaaS companies pay much attention to the SLAs, few companies really invest in it and most customers are quite clueless about it as well.

SLAs are tricky. Every SaaS provider is supposed to adhere to its service level commitments but on the whole, it is a document that most providers tend to keep out of the limelight and out of the conversation with customers. Judging from my experience, many SaaS companies use a single, non-abiding, standard SLA for all customers, keeping to a minimum their commitments and consequences.

An SLA, as its name suggests, is an agreement between the service provider and the consumers, consisting of sections regarding the various commitments to service levels that will be matched or exceeded.
Each section is defined as a Service Level Objective (SLO).

A typical SaaS SLA should have the following SLOs:
  • Service Availability – define the availability of the service represented in percentage (e.g. 99.95% uptime)
  • System Response Time – define response time of various transactions represented in seconds. (e.g. login should not take more than 9 seconds)
  • Customer Service Response Time – a response on customer enquiries should take no more than an allotted time for various services (e.g. enabling a service for a new group should take less than two business days)
  • Customer Service Availability – hours of availability of customer service represented in a ‘hours per day’ notation. (e.g. 11X5 for regular customers, 24X7 for platinum customers)
  • Service Outage Resolution Time – the times it takes to restore a service after an outage has been reported. Represented in minutes and hours (e.g. 30 minutes for a full system outage)
  • Failover Window For Disaster Recovery - how long will it take to restore the service in a disaster recovery site, if disaster disables the main datacenter.
  • Reclaiming Customer Data – a commitment to transfer all (agreed) data in an agreed format in case the customer leaves the service.
  • Maintenance Notification – the advance notice that the provider will notify customers of planned service outages, represented in days. (e.g. a planned downtime that will take more than one hour requires 10 business days notification)
  • Proactive Service Outage Notification - the time it takes for the provider to inform the customer that there are service issues, represented in minutes.
  • RFO (Reason for Outage) – a report to customers following a service outage explaining the circumstance, the incident and steps taken to remedy the problem. (For more information see the chapter on Incident Management). Some customers require an RFO automatically; in some SLAs it is written that an RFO will be generated only following a specific customer request. Usually the company commits to three business days following the service disruption.
Note the emphasis on should when referring to the SLOs of the document. The SLA provided by most on-demand companies consists of two or three paragraphs at most, regarding uptime, customer service availability and perhaps another one of the items above.
Many providers have additional services such as daily reports, daily data aggregations, or FTP services. Each one of these services merits an SLO that should be part of the document.

Some SLOs override others. In the example of an service outage, the Availability SLO takes precedence over the Response Time SLO, as you would not expect the performance of the system to be up to par when the system is down. On the other hand, this will kick start other SLOs such as Outage Notification, Resolution Time and Support Response Time.

Customer Expectations
Not all SaaS companies are created equal. They will vary by maturity, by the vertical they are serving, by the company size they cater for and, of course, by the type of application.
Some applications are core and some are peripheral. Some applications are used around the clock, like metering or call centers and the customers have zero tolerance for downtime. Other applications are rarely used outside of office hours, (e.g. payroll, talent management) and if the system is down, the price is a handful of irritated end-users that will need to take a coffee break earlier than they planned.
Larger customers tend to have more rigorous demands while lower paying customers will usually be more tolerant of the system’s performance and support availability.
Therefore, your SLA should reflect the relative position of your service along the following three vectors:
  1. Customer size (reflecting subscription [potential] size)
  2. Core vs. periphery
  3. Downtime tolerance
So if you are providing a mission critical application to a large customer, whose downtime will cost the customer real dollars, your SLA should be taken very seriously.

Service Level Breaches and Penalties
We have seen the promises that come with the SLAs, but many of these agreements fail to state the consequences to the provider of not meeting the terms.
Each SLO should also define the penalties for breaching the service level commitment.
Penalties are typically specified as a prorated credit for the following month’s subscription fees.
From the customers’ point of view, the penalties should not be flat rated but increase as the service deteriorates, so that the second outage will carry a heavier penalty than the first outage. It is rare that customers insist on this point but those that do will need to negotiate these terms separately.

There is typically a maximum. It is unusual that accumulated penalties will top the monthly subscription costs. There is a catch here. As an extreme example, if your service was down for the duration of the whole month, the customer will be exempt from paying a full month’s service fee – but this is ridiculous of course. The damage to you customers is typically orders of magnitude higher than the subscription costs.
Many SaaS customers commit up front to a year or more of service, for a reduced subscription price. A good SLA will include a section that allows the customer to breach the extended commitment if the provider failed to adhere to the service levels for, say, three consecutive months.

The next chapter will outline what all of this means to the Service Operations group and why should you care about issues that initially seem to be in the domain of Sales, Legal and Finance.

6 comments: said...

The second S = service! Nice article, we all can learn from.

Elaine Blechman said...

Unbelievable, I needed help with creating an SLA for a SAAS operation and Dani gave me all the guidance I needed. Elaine

Dani Shomron said...

Not so fast, Elaine.
The next chapter deals with the difficulties that arise from a 'good' SLA. You may think twice about following the guidance :)

GrandSLA said...

Good article. SLA is related to governing the relationship between the customer and the provider and improve alignement between business requirements and services. It is definitely the next big thing since most of the services are/will be on the cloud or outsourced. That is why we are building a company that will provide Service Level Governance as a service :)

Taylor said...

Web Hosting play an important role for your website to get a good response and user traffic that’s why you should give more attention while choosing hosting service for your website. SaaS Providers

James Inderson said...

Nice post with great details. i really like it. Thanks for sharing.
incident management