Thursday, November 12, 2009

SLA Management for SaaS

“God does not ask about our ability, but our availability.” (Source unknown)

(Yet another chapter in the book - keep the feedback coming!)

As the second ‘S’ of SaaS indicates, the on-demand company is all about providing a service and therefore one would expect Service Level Agreements to be well defined and understood in this industry, but the facts tell another story. Few SaaS companies pay much attention to the SLAs, few companies really invest in it and most customers are quite clueless about it as well.

SLAs are tricky. Every SaaS provider is supposed to adhere to its service level commitments but on the whole, it is a document that most providers tend to keep out of the limelight and out of the conversation with customers. Judging from my experience, many SaaS companies use a single, non-abiding, standard SLA for all customers, keeping to a minimum their commitments and consequences.

An SLA, as its name suggests, is an agreement between the service provider and the consumers, consisting of sections regarding the various commitments to service levels that will be matched or exceeded.
Each section is defined as a Service Level Objective (SLO).

A typical SaaS SLA should have the following SLOs:
  • Service Availability – define the availability of the service represented in percentage (e.g. 99.95% uptime)
  • System Response Time – define response time of various transactions represented in seconds. (e.g. login should not take more than 9 seconds)
  • Customer Service Response Time – a response on customer enquiries should take no more than an allotted time for various services (e.g. enabling a service for a new group should take less than two business days)
  • Customer Service Availability – hours of availability of customer service represented in a ‘hours per day’ notation. (e.g. 11X5 for regular customers, 24X7 for platinum customers)
  • Service Outage Resolution Time – the times it takes to restore a service after an outage has been reported. Represented in minutes and hours (e.g. 30 minutes for a full system outage)
  • Failover Window For Disaster Recovery - how long will it take to restore the service in a disaster recovery site, if disaster disables the main datacenter.
  • Reclaiming Customer Data – a commitment to transfer all (agreed) data in an agreed format in case the customer leaves the service.
  • Maintenance Notification – the advance notice that the provider will notify customers of planned service outages, represented in days. (e.g. a planned downtime that will take more than one hour requires 10 business days notification)
  • Proactive Service Outage Notification - the time it takes for the provider to inform the customer that there are service issues, represented in minutes.
  • RFO (Reason for Outage) – a report to customers following a service outage explaining the circumstance, the incident and steps taken to remedy the problem. (For more information see the chapter on Incident Management). Some customers require an RFO automatically; in some SLAs it is written that an RFO will be generated only following a specific customer request. Usually the company commits to three business days following the service disruption.
Note the emphasis on should when referring to the SLOs of the document. The SLA provided by most on-demand companies consists of two or three paragraphs at most, regarding uptime, customer service availability and perhaps another one of the items above.
Many providers have additional services such as daily reports, daily data aggregations, or FTP services. Each one of these services merits an SLO that should be part of the document.

Some SLOs override others. In the example of an service outage, the Availability SLO takes precedence over the Response Time SLO, as you would not expect the performance of the system to be up to par when the system is down. On the other hand, this will kick start other SLOs such as Outage Notification, Resolution Time and Support Response Time.

Customer Expectations
Not all SaaS companies are created equal. They will vary by maturity, by the vertical they are serving, by the company size they cater for and, of course, by the type of application.
Some applications are core and some are peripheral. Some applications are used around the clock, like metering or call centers and the customers have zero tolerance for downtime. Other applications are rarely used outside of office hours, (e.g. payroll, talent management) and if the system is down, the price is a handful of irritated end-users that will need to take a coffee break earlier than they planned.
Larger customers tend to have more rigorous demands while lower paying customers will usually be more tolerant of the system’s performance and support availability.
Therefore, your SLA should reflect the relative position of your service along the following three vectors:
  1. Customer size (reflecting subscription [potential] size)
  2. Core vs. periphery
  3. Downtime tolerance
So if you are providing a mission critical application to a large customer, whose downtime will cost the customer real dollars, your SLA should be taken very seriously.

Service Level Breaches and Penalties
We have seen the promises that come with the SLAs, but many of these agreements fail to state the consequences to the provider of not meeting the terms.
Each SLO should also define the penalties for breaching the service level commitment.
Penalties are typically specified as a prorated credit for the following month’s subscription fees.
From the customers’ point of view, the penalties should not be flat rated but increase as the service deteriorates, so that the second outage will carry a heavier penalty than the first outage. It is rare that customers insist on this point but those that do will need to negotiate these terms separately.

There is typically a maximum. It is unusual that accumulated penalties will top the monthly subscription costs. There is a catch here. As an extreme example, if your service was down for the duration of the whole month, the customer will be exempt from paying a full month’s service fee – but this is ridiculous of course. The damage to you customers is typically orders of magnitude higher than the subscription costs.
Many SaaS customers commit up front to a year or more of service, for a reduced subscription price. A good SLA will include a section that allows the customer to breach the extended commitment if the provider failed to adhere to the service levels for, say, three consecutive months.

The next chapter will outline what all of this means to the Service Operations group and why should you care about issues that initially seem to be in the domain of Sales, Legal and Finance.

Sunday, November 08, 2009

Inter-department Communications

(Yet another chapter in my upcoming book on SaaS Service Operations - Your feedback has been great so far; thanx and keep it coming.)

"The Problem with Communication is the illusion that it has been accomplished" - George Bernard Shaw

While it is true of any institution, communications between the various silos of the organization is particularly vital for the successful operations of a SaaS company.

The reason are that things happen much faster in an on-demand company, customers are in constant contact with the company and expectations are high for a fast turn around.

At a product company, when bad things happen to the application, nine times out of ten, the software company doesn’t even know about it, and the customer’s IT deals with it. The end user is rarely in touch with the product provider. The product salespeople tend to ‘shoot and forget’ once the commission has been paid. If things go bad, the customer can mostly blame itself for not deploying or maintaining the software correctly or for not doing its due diligence.

Multiple channel interaction
At a service company, on the other hand, a typical customer will interact through multiple channels continuously. The CIO may have a direct line to the SaaS CEO. The IT department may be in touch with professional services, and managers of the service on the customer end could be speaking with the Program Management group. Members of the Operations group will inevitably be in touch with supervisors or IT managers on the customer side, and Sales will have developed personal relationships with managers on the customer’s side, as they nurture the relationship to expand the sales in-house. And, of course, the end users might be in daily contact with Customer Support.

Customers, naturally, will be irritated when things aren’t going smoothly regarding any one of multiple scenarios. It may concern a delayed service initialization, an undelivered bug fix, an incomplete customization, an unsatisfactory report, or (ouch) a service outage. Part of the allure of on-demand service is a much faster turn around time in every aspect. The customers believe it and expect it.
Imagine the customer’s frustration when they call in any one of their contacts within the company to inquire about unresolved issues, and that person has no idea what they are talking about.

Disconnect between the groups
Typically, a SaaS company will be using a CRM that serves Sales and Customer Service. In many organizations the Sales view is radically different from the Support view and information available to one is not available to the other.
It is rare that other members have access to the CRM. Operations, Engineering, Professional Services and Program Management keep their own records in different systems for various reasons and are not trained in using a CRM.
Not surprisingly, the different silos do not have much knowledge of what each department is doing, and I have seen continuous tension between various groups and quite a lot of finger pointing when bad things happen.
It is also typical to see a startup company, where everybody occupies a single open space office, yet where so little communication takes place between the groups and political affiliations begin to form.
Resources are always limited and the demands are constantly growing; how does one prioritize the tasks and attention to a particular customer?

Service Outage and Communication
To illustrate through an acute, but none too rare, example: Many a time I had experienced a service outage that, for obvious reasons, took everybody’s focus and energy. A couple of offices down the hall sat the Sales team and across the continent were various regional Sales reps. They were not informed of the outage since they play no role in detecting, classifying or resolving the issue, and all those that knew about it were busy trying to fix it, or taking customer calls. Often the customers, especially the senior members who have established a close relationship with the sales reps, would call Sales or Program Management immediately asking for updates. The uninitiated sales rep would answer that they are not aware of any outage and perhaps the problem is local to the customer (This would usually trigger a nasty remark about the incompetency of the provider). The experienced sales rep would mutter something in embarrassment and then storm over to the Ops group demanding an explanation why, once again, Sales was not notified of the outage. Not only does the company look bad, but it also raises unnecessary tension between the groups
(This issue will be addressed in the chapter on Incident Management)

Recurring Mandated Meetings
Inter-department communication is the answer. If the managers of the different departments talk to each other on a regular and formal basis, issues can be addressed before they get out of control, plans can be communicated and a deeper understanding of the challenges of each department can be better understood.
Since Operations is at the center of it all at the end of the day, and since Operations will take the blame for whatever incident that occurs, VP Ops group should initiate these meetings. This initiative and meetings will also serve as an important PR tool for the service operations group.
Following are the inter-department sessions that should be standard in a SaaS organization to improve communication and visibility and to help prioritize tasks and address issues before they boil over.

Name: Daily Operations Sync
Frequency: Daily (15-20 min)
Suggested Time: Late afternoon
Participants: Operations, Support, Program Mgmt
Agenda: Burning issues, Service outages, Planned maintenance, Delayed deliveries, Staffing

Name: Customer Success
Frequency: Weekly
Suggested Time: Monday
Participants: Sales, Program Mgmt, Support, Operations, Professional Services, R&D
Agenda: Customer Success Score sheet, Updates, Delays, Priorities. Address Red and Orange flags

Name: Operations-Engineering Sync
Frequency: Bi-Weekly
Suggested Time: Anytime
Participants: Operations, Engineering, QA
Agenda: Requirements, Releases, Known issues, Bugs, Dev/staging environment

Name: Company Fridays
Frequency: Bi-Weekly
Suggested Time: Friday Afternoon
Participants: All employees + food & beer
Agenda: Announcements, updates and department presentations

Name: SPOF Analysis
Frequency: Quarterly
Suggested Time: Anytime
Participants: Operations, Engineering, QA, Product, Support
Agenda: Single Point of  Failure Analysis(In the book, these meeting would be discussed in more detail)

I cannot emphasize enough the importance of these meetings. Not only do they facilitate the smooth operations of the company, but they also foster better relations between the company’s groups.

Monday, November 02, 2009

Introduction to the book on SaaS Service Operations

"You can't handle the truth!" (Col. Jessep in 'A Few Good Men')

(Note: This article is part of the STORM™ methodology)

As I have mentioned in a previous post, I am working my way through writing a book on SaaS Service Operations. Using the web as a collaborative tool, I have decided to share my work, bit by bit (three chapters, so far) to test it within the community and get live feedback from those who matter, potentially those that would read and recommend it.
Following is the (draft) introduction chapter. I would dearly appreciate your feedback on content, style, typos, grammar and whether you might find such a book an interesting read.
My initial thoughts about the title are along the lines of 'Survival Guide' or 'A day in a SaaS Emergency Room'.
I am not fishing for compliments - it will beat the purpose, and yes, I can handle the truth.
Many thanx,

Introduction – or Why am I Writing This Book.

Well, someone has to write it. Numerous words have been exhausted over the years on matters SaaS, but I have seen very little being written about SaaS Service Operations, and there are no books on this subject that I am aware of.

As SaaS is becoming mainstream, it has also become the most visible and mature service in the Cloud stack. Consumer expectations have elevated such that they are demanding fast response times and a service that delivers on the availability slogan of ‘anytime-anywhere’. These expectations do not refer only to the application; but also it is expected of the customer and professional services as well. SaaS companies often excel when it relates to the first ‘S’ of SaaS, i.e. Software, but fair quite poorly with regards to the second ‘S’ – Service.

What started as an experiment of the few and the brave, will soon become the major force in the software market, and what will differentiate one company from the rest is no longer the on-demand allure or the feature set, but the level of service it provides.

I am a war veteran in this respect and have many scars to parade. There are probably very few mistakes that I have not made. Being a descendant of Homo Sapiens Sapiens, I like to think of myself as one who has learned from his mistakes and taken steps to remedy them.

‘Operational Fatigue’ is a term I coined after the umpteenth time I was awoken in the wee hours of the morning to handle an outage that occurred yet once again, after having seemingly fixed the problem two weeks prior. I could have just as well created this phrase after the two hour scheduled downtime to upgrade the service. The upgrade turned into a nine hour nightmare that was finally resolved (a couple of minutes before our major customers started their workday) by some engineering heroics. As always, these were followed by heart wrenching phone calls to the CEOs of our customers to explain what went wrong (again) and why it would not repeat.
No wonder I grind my teeth at night.

Throughout my years of practice in this space I have discovered a number of traits across the industry:
  • Most SaaS companies are structured and behave in a similar fashion
  • Most SaaS companies lack the discipline, the tools and the practices to provide an efficient and effective service operation
  • Most SaaS companies, therefore, end up paying the price of not meeting their SLAs, which leads to customer dissatisfaction, customer churn and ‘Operational Fatigue’
The intended audience for this book is whomever is responsible for the quality of customer service. That includes the CEO, the CTO, VP Engineering, VP-Director-Manager of Operations and VP-Director-Manager of customer service. All of these functions must work in unison to ensure a smooth operation both outwardly and internally.

This book is divided into four sections:
  1. The first section introduces concepts about SaaS, the evolution of the market and why the model is here to stay. Enough has been written about the subject so I will stick to some of my observations without going into a long dissertation.
  2. The second section contains insights on service operations in an SaaS company. It includes various posts published on my blog (‘Dani’s Perspective on SaaS’), over the past year. It discusses typical SaaS operations, discipline, transparency, outsourcing in the Cloud, metrics, inter-department communications, etc.
  3. The third section covers Operational Support Systems that might or might not be supported by the product. They include: Billing, On-Boarding, De-provisioning, Integration, Retention Policy, Communication and more.
  4. The final section is instructional and lays out the principles of my adaptation of ITIL for SaaS Service Operations™ . It explains what ITIL is and why I chose ITIL as a basis for defining the practices of running an efficient and effective service operation. It covers six practices that I have developed and refined throughout the years at various companies with whom I worked either as an employee or as a consultant.
By following the practices, following the workflows and deploying the tools outlined in this book, SaaS companies can instill the discipline needed to reap the benefits in a surprisingly short time.

It is not complicated, it is not expensive, nor is there sorcery involved - it only requires awareness and leadership.