Dani's Perspective on SaaS: SaaS Management Culture

Showing posts with label SaaS Management Culture. Show all posts

Saturday, March 15, 2014

SaaS Customer Success – Best Practices (Part 2)

“There are two rules for success:
1. Never tell everything that you know” (Roger H. Lincoln)

Company Culture and Customer Success
In my previous post, I discussed perspectives that dealt with the technical and formal issues of improving customer success. It suffices to remember one fact – companies are more successful when their customers are more successful. In this article I will talk about a crucial facet of SaaS: company culture - that not only affects Customer Success, but many other aspects of running a successful SaaS operation.
(I have written about Culture in a number of articles in the past – some posts are pointed out)

Multiple Touch Points
In the on-premise business model, the touch points with the customer moved slowly from one part of the organization to another: Sales and Pre-sales were engaged with the customer at the beginning; then Professional Services engaged with the customer until implementation was complete. Following that stage, the ISV did not interact much with the customer. There may have been a yearly status call; sometimes the customer inquired about bugs or perhaps sent a request for features that was pushed into a waiting list. (The product may even have been sold through a VAR, leading to an even larger disconnect).

In the SaaS model, at any given moment there could be multiple touch points with the customer: The end users may be speaking with Support, while the customer IT is talking with the Ops group; the customer business manager is engaged in a call with Sales, while the CIO is speaking with the SaaS CEO about the latest outage, and Finance on both sides are figuring out last month’s bill. Some SaaS companies assign Technical Account Managers to large customers, but that would not alleviate the problem if end users continue calling Support, or IT happens to be talking with the network manager in the Ops group. In any case, TAMs are only assigned to a few, large enterprise customers and this will not work out if the company has hundreds or thousands of active customers.
Mapping the touch points of the customer is therefore essential, and making sure that everyone who might have contact with the customer is using the CRM (or equivalent - I know, it is not cheap).

Need for Speed
Everything switches to Fast-Forward in SaaS. The sales cycles are shortened by at least an order of magnitude. The release cycles are shorter through agile methodologies and DevOps technologies; the impact of bugs is immediate (the new version is released at night and the following morning thousands of users interact with the software) and the response from the customers is almost instantaneous. Customers expect a fast turnaround and Information about success or failure travels as fast as the latest Tweet. Numerous customers may be added in a single day through self-service, or might churn at the end of the month.

Therefore, SaaS companies must shift into high gear. Especially if they come from the old, legacy, on-premise world, where the staff is used to much longer cycles in every aspect (I have been giving seminars at these old-world companies to emphasize that fact and push for cultural change). Quickness of response, of decision making, and pro-activity result from the elements discussed below: Transparency, Communications, and Openness.

Transparency
I have written in the past about Transparency and it is still a valid article. I do not wish to recycle old material but suffice to say that customers will not only appreciate the openness and honesty as a virtue; they can also benefit from the knowledge while helping your company improve. When service is down, or slow, or erroneous, covering it up will not make your customers feel more confident. If you discuss the issues with them (s**t happens, you know,) you may work out together how to deal with service degradation in the future; that will help them become more successful.

When you promise a certain feature that you know will not be delivered on time or as expected, you may gain a few weeks of industrial peace, but that will clearly not make your customers more successful. If, instead, you openly share the status of your release, your customers would be better prepared. Working together, you may formulate an alternative solution that will allow the customer to succeed while reinforcing the ties between the companies.

Communications and Openness
The fact that almost anybody in the company might be in touch with someone on the customer’s end and that everything has sped up means that inter-department communications is crucial and internal transparency is a must. There is not time for ego games, or gaining an edge by holding important information close to one’s chest.
Transparency within the company is crucial (there usually is a correlation between external and internal transparency). If you promote a culture of openness and acceptance of mistakes, problems will be reported earlier and be dealt with the maximum available information, therefore reducing their negative impact. Both the company and its customers will benefit.
(Recommendations for periodic meetings can be found in my article on Inter Department Communications)

Service Level Agreements
The Support staff usually take the first hit when bad things happen, but every employee must understand the impact of service degradation. Make sure people empathize with what it is like not to be able to complete the task that the end user is evaluated on.
SLAs are usually regarded as a necessary evil, forced upon the SaaS provider by a standard expectation in the market, and is usually reduced to the minimal accepted level. (read more about SaaS SLAs).
But, imagine a company that takes it SLAs seriously, that thinks about the Service Levels that it provides its customers and that is proud of what it can offer to the end users and is open about it.
Imagine that someone within that company actually read the SLA and thinks about how to improve the Service Levels, not only the dead horse of the Availability section. Imagine what that might do to enhancing customer success!

A Point about Priorities
Some companies emphasize that the ‘Customer is Always right’ and keeping the customer satisfied is the top priority of the establishment. With other companies, making the shareholders happy is the major driving force.

The following observation is based on my subjective experience and understanding and has no scientific or statistic corroboration:
If you manage to make your Employees Proud, Dedicated and Happy, they will make your Customers Paying, Loyal and Happy, which will make your Shareholders Fat, Supportive and Happy.

This is why company culture is so crucial. Getting people excited and unified behind a common goal of excellence and customer success will result in Excellence and Customer Success.

Thursday, December 01, 2011

The Black Swan Event in SaaS Operations

"I find that the harder I work the more luck I seem to have." - Thomas Jefferson

Nassim Taleb’s eye-opening books 'Black Swan' and (to a lesser extent) 'Fooled by Randomness' discuss the rare, unexpected and almost impossible to predict events that have a major impact (and usually tend to be disastrous). He calls these events Black Swan events, and gives samples such as World War I, stock market crashes, the PC, the Internet, and 9/11.
Interestingly enough, all the Black Swan events are easily rationalized after the event, by hindsight.

The Black Swan analogy is borrowed from the notion that while one can induce a hypothesis from observational data - e.g. all swans are white - one cannot prove that hypothesis, since after observing numerous white swans, it takes only a single black swan to refute it. Karl Popper, the science philosopher, made that notion popular in his discussion of the Scientific Method (The Logic of Scientific Discovery).

SaaS and the Black Swan
Have you ever lost your database only to find out that the backup files were deleted the previous day? Have you ever hit a major problem with a component in the system, only to find out that the support contract expired last month?

My own experience and the experience of the numerous companies I have worked with, have taught me that the next Black Swan is just around the corner, lurking in the dark and will hit you when you least expect it to. Heck, that’s the nature of a Black Swan.

The systems we deal with are so complex and interdependent that one could never analyze (let alone predict) the interconnections that govern the behavior of the services we offer. Luckily, statistics are on our side, so that most SaaS applications are stable most of the time and on average, we can predict the behavior over time. But that is just what creates a Black Swan – we observe a certain behavior for so long, that we tend to accept it as a scientific fact; until it bites us in the behind.

Running a complex SaaS operation with dozens (or hundreds) of servers, network boxes, configuration files, erratic software and all the dependencies we have on our infrastructure providers (power, internet, hardware, communications) is like driving a high speed car on a congested highway, blindfolded. We have no appreciation of how much Lady Luck is involved.

Keep in mind that the longer good things happen, the harder is the effect of the Black Swan event - remember the dot.com and the real-estate bubbles; most of us are still licking the wounds.

The Butterfly Effect
All it takes is an overflowing log file, that incapacitates the disk, that will bring the system down. Or a minor, forgotten gadget installed on one of the servers whose license has expired. A pipeline of requests starts filling up and there goes the system.
How about setting up an image of a new VM, whose IP and the DNS IP were reversed by mistake. Put it in production and slowly the wrong DNS IP starts propagating in the system. After a while the servers are not communicating with each other and the system freezes.
These tend to be catastrophic events, since they are so hard to detect and resolve. Many times, restarting the whole system is the chosen quick solution, praying that the problem will resolve itself. But in these cases, the system will behave just as badly, and by the time one realizes what is happening, major damage to the customers and your brand has been done.

Words of Wisdom
Do not despair. I am not suggesting that since a Black Swan event is unpredictable, there’s nothing you can do about it. The opposite is true.
The first step is to internalize the fact that it will occur, as the famous quote goes “s**t happens”.

“Prepare for Failure” is my motto. Take into account that at any given moment something might break.

A number of practices should be implemented early on:
Change Management: To ensure that the events are indeed rare and that one may recover quickly with the knowledge of what went wrong.

Event Management: To be able to detect early on, what is hitting the fan, and respond to it.

Availability Management: Analyze your Single Points of Failure and impact of component failure. Build your backups, your DRP and practice recovery.

Incident Management: Make sure you cover these practices: Detection, Recording, Classification, Notification, Escalation, Investigation, Diagnosis, Restoration and Closure.

The Wise and the Smart ones
I was approached by a few (emphasis on few) CEOs and COOs that felt uncomfortable about the fact everything was going smoothly. Some were on the verge of fast growth and wanted to assure themselves that they were better prepared to hit the highway. Others had a feeling in their bones that “too good for too long” was a recipe for disaster, even if they did not read Nasssim Taleb’s book.

But many potential customers I spoke with assured me that they really do not need my services since they are doing very well, thank you. Some are still doing very well and others had a large hat to eat and many letters of regret to write their customers.

Monday, May 30, 2011

Organizational Culture and Company DNA – What Makes a Successful SaaS Company.

“No people come into possession of a culture without having paid a heavy price for it” (James A. Baldwin)

It was always clear to me that success with SaaS was not about technology, but about execution. This week I got a clear reminder.

The SaaS CEO Forum
For the past few months I have been running a SaaS CEO Forum that meets every six weeks or so, each time hosted at a different SaaS company, by a member of the Forum . The forum consists of a select group of successful SaaS companies, that have been selling their service for a number of years and are dealing with issues such as growth, operations, sales, marketing and customer satisfaction. Every meeting has a theme such as running a fabulous inside sales teams, SaaS Service Operations, knowledge-as-a-service, etc.. Yesterday the forum was hosted by Avinoam Nowogrodski, founder and CEO of Clarizen, a fast growing, market leader on collaborative project management. Beyond the very interesting review of the company’s clockwork marketing and sales operation, Avinoam gave a presentation on what makes a SaaS company successful.

Successful SaaS Company
No one can argue with the success of Clarizen, having grown 400% year over year, with an ever-growing community of happy customers, so it was worthwhile listening to Avinoam’s credo.

Clarizen’s CEO was talking about managing a company where execution is paramount and where Customer Success always comes first. He has been careful selecting an executive team that he regards as ‘A’ players and nurturing a culture of Respect, Modesty, Openness, and Accountability.

Among the factors Avinoam mentioned was “checking your ego at the door”, willingness to take risks, and acceptance of mistakes as part of the ever changing environment and conditions. Delegating Authority, Hands-on in your domain, Transparency and above all – Measure, Measure, Measure every aspect of the company’s operation; sales, marketing, conversion between each stage of the pipeline, responsiveness, costs.

The SaaS Angle – Fast Forward
While I agree wholeheartedly with all the above criteria being critical for a successful company, I asked for the SaaS angle. The answer was obvious even before I finished asking the question: “Pace”. In a SaaS company everything is fast-forwarded. The cycles in almost every aspect shorten and therefore the margins of error are ever so narrow. In a company that caters for the SMB in a low-touch model, the sales cycles are measured in days, not quarters, the software releases are shortened to weeks. The discovery of bugs usually occur within hours (or minutes) after a new version is introduced. Hence, Openness and Transparency are paramount and there is no time for ego games or controlling vital information (I have written about these aspects in the past: Transparency & Communications).

To reiterate – in a fast-pace, ever-changing, 24X7 environment, the need to feel the operational pulse, the need for responsiveness and open communications, the need for a listening ability and accountability are vital for success. Time spent on BS, on analysis-paralysis, on political games, on territorial squabbles, is time spent away from making sure your customers are successful, and that will be evident on the company’s bottom line and eventually, on the quarterly bonuses.

Friday, May 21, 2010

The SaaS VP Operations as Product Manager

SaaS Operations Support Systems

“'Tis not enough to help the feeble up, but to support them after” - William Shakespeare

Every SaaS company needs to deal with numerous functions that are not necessarily part of the technological stack that originally came with the application, namely the Operations Support Systems.

If a SaaS start-up had unlimited time and funds to plan and build the perfect solution, they would probably all be in the Caribbean islands doing what people with unlimited time and funds do.

Just Do It!
Because of the nature of monetizing SaaS, companies try to get to market as soon as possible, getting subscriptions fees streaming in. Sometimes, they even launch with a half baked solution that will provide added value to the customers at the expense of future operational headaches.
The logic behind this is that dealing with a scalability problem, is a good thing. In other words - who doesn’t want to reach the stage when too many customers are taxing the team? We’ll deal with that when it becomes a problem.

These days, numerous PaaS offerings or SaaS frameworks offer built-in operational support systems functionality, but many of the necessary features are not supported.
Most of the SaaS application in the market were not built with these frameworks for various reasons. The most prevalent are that they were not available a few years ago, and that engineers have a tendency to build everything from scratch, or using frameworks (e.g. LAMP, Java, .Net, RoR) that they are familiar with.

When the typical SaaS service is launched, it lacks functionality that would support the scalability of the service operation. It is left to the Service Operations team to deal with all the 'maturity' functionality that the product lacks.

Let us examine some of the Operations Support Systems functions that are typically not handled by the application.

On-boarding new customers
One would expect most SaaS systems to have automatic provisioning. While that is true in many cases, a lot of the systems allow the customer to define users, but creating a new customer entity is left to the Support or Operations team. That may require generating a new database or schema, or setting up storage, etc. If the company is adding one customer a week, it may be manageable, but at a higher rate this is a taxing job and error prone.

De-provisioning
While it is expected that some level of automatic provisioning is provided with the product, very few SaaS applications provide a simple (never mind automatic) mechanism for removing existing customers. Very few SaaS developers design with of the prospect of loosing a customer in mind. Beyond the task of removing dependencies from the database, there are issues of releasing resources, and removing customizations.

Billing
Such a basic function, (one would think, for a company that lives or dies by subscriptions) is typically lacking from most SaaS infrastructures. While the situation is improving dramatically with the advent of PaaS and SaaS development frameworks, many systems still start out with excel sheets and a lot of manual work. Customer data must be extracted by some ad-hoc solution form the application database or is duplicated in the CRM, which causes endless synchronization errors. Ad-hoc solutions are implemented (either in-house or SaaS billing solutions) as the billing becomes ever more complex but until that occurs the brunt of the work falls upon the Operations team.
Metering is a whole new layer of complexity, if the billing is more complex than a fixed amount per seat per month.

Retention Policy
Some SaaS companies plan ahead for resource consumption by their customers, bur many start dealing with the issue only when they begin running out of storage, or when their storage costs are starting to hurt. Most customers want to retain their data forever on the provider’s disks, but that is impossible. So a retention policy must be defined and followed (such as delete files that are older than X months or larger than Y gigabytes).
The problem is that the application does not support that, so manual work is requried or ad-hoc solutions have to be built around the product.

Failover and Backup
Automatic recovery and failover are rarely built into the initial solution and are usually managed by the Operations team via building complex solutions with networking boxes.
This is true for backup and recovery mechanisms as well. The Operations team frequently has to build mechanisms around the production to support backup and recovery and most often they require much manual labor.

Application Monitoring
Well designed software is packed with instrumentation that is easy to turn on and off and easy to monitor and interpret. Not all SaaS systems have that built-in capability and the Operations team has to create an application-specific, monitoring infrastructure that can detect and react quickly to service degradation.

Seamless Upgrade
Upgrades are the recurring nightmare of any Operations team, especially on applications that have a demanding uptime SLA. Few SaaS systems are designed with that goal in mind. It usually takes a level of maturity and a sizeable customer base to get Engineering to consider revising the code to allow some sub-systems a no-downtime upgrade.
The problem is that is usually requires major revisions of the code if the system was not designed a-priori to handle a silent upgrade.

SLA Management
Whether you are using automated SLM or not (chances are you are not), you need to take into consideration the various aspects of your service that need to be metered, tracked and compared against a set of Service Level Objectives.

End User Broadcasting
Sometimes it is necessary to communicate with all of you users, or a sub group of them in real time. What a better option is there that to pop up a message on the end-user’s browser that you can compose on the spot or pull from a list of canned messages? The operations team need an integrated solution with the product to be able to do that.

Operations Console
And to tie it all up, a separate application that controls all the operational aspects of the service is needed. Included should be: provisioning/de-provisioning customers, password management, real-time login view of current users, real-time view of application usage, customer communication console, production environment control, etc.

There are many other operational features that are typically not found in a SaaS application as it leaves the factory floor. Just to mention a few: Integration, Reporting engine and reporting Database, Security, Scale-up and Scale-out capabilities, Sandbox, Status Page.

Development and Product management Experience Required

The VP Operations (or whatever title the job carries) is required to be a product manager of sorts, and a background in software development is almost a must. The Operations manager should be highly involved in the product roadmap and insist on having a say in defining future releases. There will be a contention between investing in Serviceability versus Functionality. Since the paying customers require more functionality, it is usually an uphill battle to gain service upgrades to the product.

While building an organic set of solutions into the product may practically take years, Operations cannot continue to throw bodies at solving scalability and downtime issues. So beyond influencing the product group on the direction in which the application should be developed, the Operations manager needs to build a set of tools addressing the Operations Support Systems needs as stated above. One option is to nurture a relationship with VP Engineering and get resources from her group to build well defined solutions that could each be completed in a couple of weeks of work. This is especially true in early stage companies where the Operations team is small and the engineers take the brunt of many of the operations’ tasks. The VP Engineering would appreciate the need as members of her team are feeling the pain as well.
In a more established company, the VP Operations must make sure that there are coding/scripting capabilities in the team, so simple projects and tools could be developed within the team, with minimal aid from the Engineering group.

Thursday, March 18, 2010

Change Management and the Sanctity of Production

“Most people are afraid of change. We love it!” (Sign of a beggar on a street in San Francisco, 2004)

(Note: This article is part of the STORM™ methodology)

Change is the greatest cause for service interruption in any IT operations. Period. Stop. Exclamation mark.
In the extreme, one might argue that change is the cause for every service interruption if one counts hardware or software malfunctions as a change as well.
SaaS operations tend to suffer more from a lack of proper change management for two reasons: First, the consequences of a service outage for a company whose entire existence depends on its service, is dire. Second, SaaS engineers, as I have written in numerous posts, lack the discipline that is more inherent in IT departments.

And yet, my experience has been that in most SaaS companies, changes are unsupervised, undocumented, unauthorized, unplanned, (sometimes unnecessary), underestimated and un_____ (fill in the blanks).

The importance of Change Management cannot be overstated and it is the first practice that I have implemented at companies that I worked at (or for). I wince when I recall the casualness which I have witnessed at various SaaS companies about making changes in the production system. I can quote my former boss, Mansur Salame, CEO of Contactual, saying that “production should be treated as sacred, with utmost respect appropriate to holy places” (or something of the sort). And, boy, were we sacrilegious back in those days!

In the Chapter on Change Management in the upcoming book, I will present my STORM™ adaptation with much detail.

In this post I will outline some guidelines for sane change management.

Sixty Seconds on Change Management
Below are listed the objects that comprise a comprehensive Change Management practice.

RFC – Request for Change document. Must initiate the process, regardless how small or major the changes are. It should include the what, the why, the when, the risk, the potential impact (on customers or components), and a checklist of notifications and tests that should/should not be done.

A Change Window must be defined, clearly notating what type of changes to what subsystems are allowed at which days, during what hours.

Change Calendar – which might be implemented in a number of static or automatic formats, must represent the ‘Change Window’, and depict all planned changes by the company, service providers and customers, and must be part of the RFC process.

Change Advisory Board or CAB is the pre-determined group or people who scrutinize and approve the RFC. The CAB may be large or small but it should include at least one person who is not involved in the request and planning process. The CAB may meet on a recurring schedule or as needed.

A Change Record is a record describing a change that occurred in production or the eco-system. It could be implemented as a database or excel or within a ticketing system. It should include the what, when and impact (on customers or components). Much important information could be derived from this data store that pertains to the Incident and Availability Management practices.

The Maintenance Plan is a detailed document defining the pre-requisite tasks, the maintenance tasks, rollback tasks and post-maintenance tasks. Each task should have a description, an owner, a time and duration. In most cases the plan must be scrutinized to the lowest detail level, and practiced in a Pre-Production environment that should mimic the production environment as much as possible.

The sixty seconds are over. This was just a teaser. Obviously there are templates, workflows and a sleuth of details that tie all of these objects into a well-oiled practice. The book will expand on the details and include the workflows, the templates and methods for automation.
To summarize; as the market matures and competition thrives, the big differentiator will be the second ‘S’ in SaaS and customers will become less and less forgiving. If a SaaS company does not practice a robust Change Management practice it will end up paying in a frustrated staff and customer churn.

Sunday, January 31, 2010

SLA Consequences to Service Operations

"What, me worry?" (Alfred E. Neuman)

In my previous post I discussed some basic concepts about SLAs, SLOs and penalties. As promised, I am addressing the ‘who cares?’ question.

From a Service Operations point of view, you may shrug your shoulders and claim that these are issues with the Legal and Finance departments. Most likely you were brought on board later in the game and never viewed an SLA until your were forced to do so.

As the person responsible for keeping all the services up and running, it may be best to keep the SLA to a minimum. After all, a document containing vague language, with little commitment and liability would be hard to wave in front of your face when the service levels drop.

I will argue that vagueness will play against you. A tough SLA will require the company to adhere to the high service levels they are committed to, and yes, pay the penalties for breaching these agreements. Keep in mind that if your service level drops one time too many, the legalese you will be hiding behind will not save your butt when customers drop from the service or simply do not renew.

I would take it even one step further. I advocate that the Service Ops managers bonuses are tied to achieving those SLOs that will keep a smile on the customers’ faces. (typically up-time and response time, but in some cases there are other objectives that are crucial to the customers). The carrot and the stick should work nicely to assure that you are doing the utmost to live up to the agreements.

Commitments
Another issue that concerns you (Service Operations) is that Sales are making commitments that you are suppose to keep, usually without you ever knowing about it. Operations needs to initiate a fact finding effort to learn what is there. You need to know what you are capable of providing. Everybody likes to state that they are five nines (99.999% uptime) but how many companies out there really are? You need to monitor and test your service over a substantial period of time before you commit to those numbers.

Another point in favor of having a good grasp of your SLA is that you, as a consumer of services would be conscience of your requirements vis-à-vis your service providers.

That will include the hosting services, your ISP, your communications provider, and whatever cloud services you are using. In my past positions as VP Service Operations I have been appalled by the contracts that my predecessors have signed with service providers. Some of them had no consequences to service level degradation. Others had ridiculous clauses such as 'for every hour of downtime, the credit would be for one hour prorated service cost' which meant that there was no real penalty. Another contract stated that we could get out of the agreement if for three months in a row(!) the service provided was available for less that 75% of the time.
We would have been out of business by then.

Where are those damn SLAs?
As we have seen, SLAs that are broad and meaningful will be complex. Add to that various service levels such as Standard, Gold and Platinum and the fact that some customers have negotiated special terms for themselves, and you are dealing with a mean, slimy problem.

To compound that problem, nine times out of ten, these documents are sitting on someone’s laptop in a PDF format with perhaps a hard copy in a dusty folder, in the cabinet below the espresso machine.

Imagine the exercise of figuring out if an SLA was breached for a particular customer, and if that breach carries a penalty.

I have painfully gone through that exercise too many times, and believe you me - I had much better things to attend to following a service outage. The process was extremely slow, finding the various documents, looking up the terms and comparing the events with them.

Then a calculation was needed as to how much credit was due. And all this was done for a single customer. Multiply that by the number of customers that may have been affected and you have just wasted many good hours of Solitaire.

SLA Management Tools
There are multiple tools out there (some are offered as SaaS) to manage your SLAs. Many of them provide a full cycle of defining SLOs, creating SLAs, generating the documents, monitoring performances against obligations, computing compensation and generating reports. I have not used any of them (although I used to work at an SLM ISV), so am not about to promote any single one, but there are very slick solutions available.

If you are at an early stage, it would be hard sell for you to justify to management that you need to start paying for a service that possibly no one in the company comprehends.

Typically, when a SaaS company starts out there are very simple, non-abiding, fixed SLAs, so there is very little attention paid to this aspect of the business.

As with any aspect of SaaS Service Operations, scalability issues hit you when you least expect them.

As most (all?) SaaS companies do not start with Service Level Management software, by the time it becomes a burden they will have many dozens, or hundreds of such SLAs. The effort of converting them to an automated system is daunting.

Therefore, you can start structuring your existing and future SLAs into a simple excel, or DB so that they are easily accessible, and comparable.

An example of a typical SLA would be stored in a table such as below.
The values for the various SLOs in the table were automatically populated from the definitions in the pre-defined Platinum and Gold tables (which state the default values for these SLAs). They may be overridden by specific values, following negotiations for a particular customer.

Cust.	Calia	…	Google
Cust ID	123	…	213
SLSLA	Gold	…	Platinum
Uptime	99.9	…	99.99
Response time	under 6 sec	…	under 4 sec
Support Response time	2 hours	…	30 min
Support Avail.	12X6	…	24x7
Major Outage Resolution	1 hour	…	30 min
Partial outage resolution	4 hours	…	2 hours
Minor Outage Resolution	12 hours	…	6 hours
Maint. Notification	10 days	…	2 weeks
FTP	12 hrs	…	6 hours
Outage Notif.	Email 1 hours	…	email + call 30 min

In the book I will elaborate on the structures and the tools and how to automate the compensation computations.

Thursday, November 12, 2009

SLA Management for SaaS

“God does not ask about our ability, but our availability.” (Source unknown)

(Yet another chapter in the book - keep the feedback coming!)

As the second ‘S’ of SaaS indicates, the on-demand company is all about providing a service and therefore one would expect Service Level Agreements to be well defined and understood in this industry, but the facts tell another story. Few SaaS companies pay much attention to the SLAs, few companies really invest in it and most customers are quite clueless about it as well.

SLAs are tricky. Every SaaS provider is supposed to adhere to its service level commitments but on the whole, it is a document that most providers tend to keep out of the limelight and out of the conversation with customers. Judging from my experience, many SaaS companies use a single, non-abiding, standard SLA for all customers, keeping to a minimum their commitments and consequences.

An SLA, as its name suggests, is an agreement between the service provider and the consumers, consisting of sections regarding the various commitments to service levels that will be matched or exceeded.
Each section is defined as a Service Level Objective (SLO).

A typical SaaS SLA should have the following SLOs:

Service Availability – define the availability of the service represented in percentage (e.g. 99.95% uptime)
System Response Time – define response time of various transactions represented in seconds. (e.g. login should not take more than 9 seconds)
Customer Service Response Time – a response on customer enquiries should take no more than an allotted time for various services (e.g. enabling a service for a new group should take less than two business days)
Customer Service Availability – hours of availability of customer service represented in a ‘hours per day’ notation. (e.g. 11X5 for regular customers, 24X7 for platinum customers)
Service Outage Resolution Time – the times it takes to restore a service after an outage has been reported. Represented in minutes and hours (e.g. 30 minutes for a full system outage)
Failover Window For Disaster Recovery - how long will it take to restore the service in a disaster recovery site, if disaster disables the main datacenter.
Reclaiming Customer Data – a commitment to transfer all (agreed) data in an agreed format in case the customer leaves the service.
Maintenance Notification – the advance notice that the provider will notify customers of planned service outages, represented in days. (e.g. a planned downtime that will take more than one hour requires 10 business days notification)
Proactive Service Outage Notification - the time it takes for the provider to inform the customer that there are service issues, represented in minutes.
RFO (Reason for Outage) – a report to customers following a service outage explaining the circumstance, the incident and steps taken to remedy the problem. (For more information see the chapter on Incident Management). Some customers require an RFO automatically; in some SLAs it is written that an RFO will be generated only following a specific customer request. Usually the company commits to three business days following the service disruption.

Note the emphasis on should when referring to the SLOs of the document. The SLA provided by most on-demand companies consists of two or three paragraphs at most, regarding uptime, customer service availability and perhaps another one of the items above.
Many providers have additional services such as daily reports, daily data aggregations, or FTP services. Each one of these services merits an SLO that should be part of the document.

Some SLOs override others. In the example of an service outage, the Availability SLO takes precedence over the Response Time SLO, as you would not expect the performance of the system to be up to par when the system is down. On the other hand, this will kick start other SLOs such as Outage Notification, Resolution Time and Support Response Time.

Customer Expectations
Not all SaaS companies are created equal. They will vary by maturity, by the vertical they are serving, by the company size they cater for and, of course, by the type of application.
Some applications are core and some are peripheral. Some applications are used around the clock, like metering or call centers and the customers have zero tolerance for downtime. Other applications are rarely used outside of office hours, (e.g. payroll, talent management) and if the system is down, the price is a handful of irritated end-users that will need to take a coffee break earlier than they planned.
Larger customers tend to have more rigorous demands while lower paying customers will usually be more tolerant of the system’s performance and support availability.
Therefore, your SLA should reflect the relative position of your service along the following three vectors:

Customer size (reflecting subscription [potential] size)
Core vs. periphery
Downtime tolerance

So if you are providing a mission critical application to a large customer, whose downtime will cost the customer real dollars, your SLA should be taken very seriously.

Service Level Breaches and Penalties
We have seen the promises that come with the SLAs, but many of these agreements fail to state the consequences to the provider of not meeting the terms.
Each SLO should also define the penalties for breaching the service level commitment.
Penalties are typically specified as a prorated credit for the following month’s subscription fees.
From the customers’ point of view, the penalties should not be flat rated but increase as the service deteriorates, so that the second outage will carry a heavier penalty than the first outage. It is rare that customers insist on this point but those that do will need to negotiate these terms separately.

There is typically a maximum. It is unusual that accumulated penalties will top the monthly subscription costs. There is a catch here. As an extreme example, if your service was down for the duration of the whole month, the customer will be exempt from paying a full month’s service fee – but this is ridiculous of course. The damage to you customers is typically orders of magnitude higher than the subscription costs.
Many SaaS customers commit up front to a year or more of service, for a reduced subscription price. A good SLA will include a section that allows the customer to breach the extended commitment if the provider failed to adhere to the service levels for, say, three consecutive months.

The next chapter will outline what all of this means to the Service Operations group and why should you care about issues that initially seem to be in the domain of Sales, Legal and Finance.

Sunday, November 08, 2009

Inter-department Communications

(Yet another chapter in my upcoming book on SaaS Service Operations - Your feedback has been great so far; thanx and keep it coming.)

"The Problem with Communication is the illusion that it has been accomplished" - George Bernard Shaw

While it is true of any institution, communications between the various silos of the organization is particularly vital for the successful operations of a SaaS company.

The reason are that things happen much faster in an on-demand company, customers are in constant contact with the company and expectations are high for a fast turn around.

At a product company, when bad things happen to the application, nine times out of ten, the software company doesn’t even know about it, and the customer’s IT deals with it. The end user is rarely in touch with the product provider. The product salespeople tend to ‘shoot and forget’ once the commission has been paid. If things go bad, the customer can mostly blame itself for not deploying or maintaining the software correctly or for not doing its due diligence.

Multiple channel interaction
At a service company, on the other hand, a typical customer will interact through multiple channels continuously. The CIO may have a direct line to the SaaS CEO. The IT department may be in touch with professional services, and managers of the service on the customer end could be speaking with the Program Management group. Members of the Operations group will inevitably be in touch with supervisors or IT managers on the customer side, and Sales will have developed personal relationships with managers on the customer’s side, as they nurture the relationship to expand the sales in-house. And, of course, the end users might be in daily contact with Customer Support.

Customers, naturally, will be irritated when things aren’t going smoothly regarding any one of multiple scenarios. It may concern a delayed service initialization, an undelivered bug fix, an incomplete customization, an unsatisfactory report, or (ouch) a service outage. Part of the allure of on-demand service is a much faster turn around time in every aspect. The customers believe it and expect it.
Imagine the customer’s frustration when they call in any one of their contacts within the company to inquire about unresolved issues, and that person has no idea what they are talking about.

Disconnect between the groups
Typically, a SaaS company will be using a CRM that serves Sales and Customer Service. In many organizations the Sales view is radically different from the Support view and information available to one is not available to the other.
It is rare that other members have access to the CRM. Operations, Engineering, Professional Services and Program Management keep their own records in different systems for various reasons and are not trained in using a CRM.
Not surprisingly, the different silos do not have much knowledge of what each department is doing, and I have seen continuous tension between various groups and quite a lot of finger pointing when bad things happen.
It is also typical to see a startup company, where everybody occupies a single open space office, yet where so little communication takes place between the groups and political affiliations begin to form.
Resources are always limited and the demands are constantly growing; how does one prioritize the tasks and attention to a particular customer?

Service Outage and Communication
To illustrate through an acute, but none too rare, example: Many a time I had experienced a service outage that, for obvious reasons, took everybody’s focus and energy. A couple of offices down the hall sat the Sales team and across the continent were various regional Sales reps. They were not informed of the outage since they play no role in detecting, classifying or resolving the issue, and all those that knew about it were busy trying to fix it, or taking customer calls. Often the customers, especially the senior members who have established a close relationship with the sales reps, would call Sales or Program Management immediately asking for updates. The uninitiated sales rep would answer that they are not aware of any outage and perhaps the problem is local to the customer (This would usually trigger a nasty remark about the incompetency of the provider). The experienced sales rep would mutter something in embarrassment and then storm over to the Ops group demanding an explanation why, once again, Sales was not notified of the outage. Not only does the company look bad, but it also raises unnecessary tension between the groups
(This issue will be addressed in the chapter on Incident Management)

Recurring Mandated Meetings
Inter-department communication is the answer. If the managers of the different departments talk to each other on a regular and formal basis, issues can be addressed before they get out of control, plans can be communicated and a deeper understanding of the challenges of each department can be better understood.
Since Operations is at the center of it all at the end of the day, and since Operations will take the blame for whatever incident that occurs, VP Ops group should initiate these meetings. This initiative and meetings will also serve as an important PR tool for the service operations group.
Following are the inter-department sessions that should be standard in a SaaS organization to improve communication and visibility and to help prioritize tasks and address issues before they boil over.

Name: Daily Operations Sync
Frequency: Daily (15-20 min)
Suggested Time: Late afternoon
Participants: Operations, Support, Program Mgmt
Agenda: Burning issues, Service outages, Planned maintenance, Delayed deliveries, Staffing

Name: Customer Success
Frequency: Weekly
Suggested Time: Monday
Participants: Sales, Program Mgmt, Support, Operations, Professional Services, R&D
Agenda: Customer Success Score sheet, Updates, Delays, Priorities. Address Red and Orange flags

Name: Operations-Engineering Sync
Frequency: Bi-Weekly
Suggested Time: Anytime
Participants: Operations, Engineering, QA
Agenda: Requirements, Releases, Known issues, Bugs, Dev/staging environment

Name: Company Fridays
Frequency: Bi-Weekly
Suggested Time: Friday Afternoon
Participants: All employees + food & beer
Agenda: Announcements, updates and department presentations

Name: SPOF Analysis
Frequency: Quarterly
Suggested Time: Anytime
Participants: Operations, Engineering, QA, Product, Support
Agenda: Single Point of Failure Analysis(In the book, these meeting would be discussed in more detail)

I cannot emphasize enough the importance of these meetings. Not only do they facilitate the smooth operations of the company, but they also foster better relations between the company’s groups.

Thursday, August 20, 2009

Discipline (or lack thereof) and Operational Fatigue

“Half of life is luck; the other half is discipline - and that’s the important half, for without discipline you wouldn’t know what to do with luck”- Carl Zuckmeyer

Creative and nonconformists
SaaS companies are mostly composed of a group of highly capable software engineers. These techies are, by nature, creative, imaginative, out-of-the-box engineers, inventing new ideas or new ways of achieving better results. They tend to adopt the latest and greatest technologies and are always looking forward to the next best thing.
Naturally, these engineers are nonconformists and not inclined to follow rules or to stick to routine.
Almost always, they do not come from an enterprise IT environment, where rules and regulations are stricter and operational practices are followed almost religiously.
With the nascent state of SaaS, if the engineers have prior experience, it would mostly come from on-premise, product companies that emphasize features, versatility and usability.
They rarely had to deal with customers, and bugs that were found were handled according to their priority to be fixed in the next release (which could be months away).
Therefore, typical SaaS engineers lack the necessary discipline to run a 24X7 service, and are usually hostile to restrictions imposed on them.

What, me worry?
The lack of discipline manifests itself mainly in Change Management and consequently in Asset Management and, then, consequently in Incident Management.
This refers to what changes are allowed to be done when (‘hey, just to let you guys know, I installed the new patch during lunch break’), how are they approved and communicated (‘yeah, no prob, I tested the code on my laptop – it is foolproof, just a small change in the parsing engine’) how they are recorded and rolled back if necessary (‘don’t worry, I keep all changes in a dedicated notepad on my machine’).
There usually are no rules about touching production. Typically, every engineer has full SUDO access to all servers in the data center, using a single super-user login, so that activities cannot be traced to any specific person.
One-offs can be installed on a particular server and not be documented. Months later when a new version is installed or a server replaced, things fail to work and it may take hours for someone to remember that a special component is not functioning any more.
Lack of a fully functional staging environment may cause an engineer to ‘temporarily test’ some feature on a production machine that either causes service disruption or is forgotten until the fan turns brown.

Operational Fatigue
Operational Fatigue is a term I coined after years in the trenches, of waking up at 3:00 AM to deal with the same problem that hit us three weeks ago; of the stress of dealing with an incident at peak time when Management is hysterical, when Sales are complaining, when Support is overwhelmed with frustrated customers; of making the calls to the high profile customers, explaining, apologizing, promising; of having to explain to the Board why we lost so many customers this quarter.
It gets to you. You discover new gray hair and develop a fear of answering the phone.

The point is – it is avoidable. Instilling the practices and discipline can make life so much easier and allow the ops team to plan and improve instead of fighting fires all the time.

Educating the young
Like toddlers, engineers crave for guidance and discipline, but as most parents would testify, they will make every attempt to break the rules and stretch the envelope to test the boundaries of their environment. Experienced parents will tell you that the young children feel much more secure when they know the rules and when the rules are being enforced. It has been my experience that when I introduced a new set of regulations such as in Change Management, there is always an initial push-back, mumbling about bureaucracy and attempts to circumvent the rules in the beginning. But I have always seen a quick adoption of the new regulations, followed by a realization that life would be so much better if we only stick to the rules – these guys are smart, you know. Many a disaster was avoided by playing the game by the new rules and I found out how quickly the engineers embraced the discipline and started devising ways to improve on and automate the processes.

Just do it!
I recently participated in a round table hosted by HP on the subject of Change Management. Most of the participants were from large IT shops and were talking about adapting to new Change Management processes in terms of six to twelve months. I was astonished. I concede that my background has been with much smaller groups, and I had the full backing of the executive management, but twelve months? Jeez!

The process in my experience was:
· Prepare the documents, templates and work-flows.
· Make a compelling Power Point presentation.
· Present to the Engineering, Ops and Support groups.
· Emphasize the consequences of not following the practice (genitalia hanging at high altitude)
And Voila - It works! A few weeks later you have a spiritual following of admirers, because the fruits of the labor are so obvious in a very short time.

Thursday, August 13, 2009

Transparency in SaaS Service Operations

“Life is filigree work. What is written clearly is not worth much, it's the transparency that counts.” - Louis-Ferdinand Celine

Companies like to boast about their transparency, but in practice, information dissemination is highly controlled. At an on-demand company, hiding the backstage operations seems like a smart thing to do. As long as you are servicing the customer, and as long at the customers do not complain, why should you wash your dirty laundry in the public?
So what about SLAs? The guiding principle seems to be ‘Don’t worry about them if your customers do not demand them’. And even when they do, there are SLAs and then there are SLAs. There are so many ways to interpret these elusive numbers (assuming you even know the real ones) that most companies will portray better results than those that reflect reality.

Varying degrees
There are different modes of Transparency communications; from the non existent to the reactive, the proactive and full disclosure.

The reactive type is the common case where there are service disruptions and customers call in to complain. In this case you will determine how much information you would like to divulge. This could be done with a customer call, an RFO (Reason for Outage) that is sent to particular customers or a message on the corporate site.

A proactive approach would have a Service Status Page depicting the current service availability of the various production systems.

A full disclosure mode will provide customers with a historical view of production systems availability and response time such at Salesforce’s Trust or SAManage’s Status Page .

Advantages of Transparency
My experience has been that the more transparent you are with your customers, the better relationship you will foster with them and the more forgiving they will be when things turn sour. And things do turn sour; it is unavoidable.
Your customers are not dumb (in general, that is – I can relate many amusing stories of individuals that should have not been awarded fourth grade graduation, but that is another story). The people on the other end generally understand that you are dealing with a complex environment with many factors that are not always under your control. They will be willing to accept that scheisse happens, but they also must know that you are ready to accept responsibility and learn from these events. There should be a closure process for each event including Incident Recording, Post Mortem, RFO communication (more on that in Incident Management).
Of course, nothing beats a good, reliable, available and responsive service. If you are not able to provide that, you will end up loosing your customers regardless of how much camouflage and finger pointing are used to cover the smell.

How transparent should you be?
I am not advocating that you have to run out and tell the guys every time you messed up or that you should bombard the customers with a technical exposition as part of the RFO document.
Striking the balance is an art that comes with practice and common sense. If an incident occurred that did not disrupt services, you must undergo the full Incident life-cycle practice to ensure that lessons are learned and the incident will not repeat. But you do not necessarily have to go and boast about it.
As for the RFO, in my days I have been asked to put my signature on many customer facing documents that had a bland, general, canned message that meant nothing to the reader. (“service was lost do to a system failure”). I realized that customers will not trust the messaging and choose to either ignore it while snorting in disgust or have a techie call in and start drilling the poor customer service rep for technical details which would be hard to provide.
I have also seen RFOs that contained multiple pages and read like a PHd dissertation in electronic engineering. I do not know who approved these RFOs and if the purpose was to wear down the suffering reader so that further RFOs will never be requested.

Company Culture
And finally, keep in mind that if the company’s culture tolerates half-truths and spins when facing the customer, you run the risk of it percolating through the company’s internal activities and reports. Don’t you expect your employees to be truthful, accountable and not shy away from reporting mistakes, even if it makes them look not too great? Your customers have to expect your company to do the same. And, if the results of truthful reporting will cost you a customer then something was probably wrong with the relationship to begin with, and the customer may have been looking for an excuse to break away.

Wednesday, May 20, 2009

Questions that SaaS executives must be able to answer - KPIs that matter.

“There is much pleasure to be gained from useless knowledge” (Bertrand Russell)

It has been my experience that SaaS executives have trouble answering the most basic questions about their service operations, and mind you, this is what the business is all about.

Again and again, I keep coming back to the conclusion that the fact that state of SaaS Service Operations is so dire is due to the fact that on-demand companies are built on the first ‘S’ (software) and not the second 'S' (Service).

SaaS entrepreneurs are, in general, bright, creative, out-of-the-box thinkers. They are software developers and have no clue about IT practices and disciplines.

The age old premise “if you can't measure it you can't manage it” somehow escapes SaaS companies across the globe, until it becomes a huge problem.

Have you gone through the numbing process of presenting a specific customer with their real SLA adherence? I have. On average, it would take me a few hours of going through multiple sources of data to come up with (sometimes) accurate data.

Following are a number of questions (an incomplete list) that every SaaS executive should be able to answer in her sleep, or at least with a click of a button.

1. Availability management

What are your real uptime numbers?
How do the trialing twelve months (TTM) look like
Are we better than we were six months ago?
How many outages have you had in the last M months?
What is the breakdown, based on severity?
What is the breakdown, based on downtime causes?
How many service disruption incidents were repeated?
How quickly do you recover from outages?
How many days have gone by without a critical, major outage?

2. SLA Management

How does your availability match up to your customer commitments?
Which customers were affected most (even if they do not complain)?

3. Change Management

How often are changes made to the production environment?
What is the breakdown of changes by category?
What percent of changes did you have to roll back?

4. Asset management

What is the status of your inventory? What box is located where?
What function or customer would be impacted by a loss of a certain box?
When do your support/software contracts expire and what might it affect?

5. Cost Management

What are the actual costs of the operations?
How is the budget allocated among the various components?
How much does each new (N) customer(s) cost?
Are we getting the full value from our supply chain?

6. Churn Management

How many customers have you lost in the past 6, 12, 24 months?
Is your customer retention improving over time?
What percent is your customer churn out of your customer base?
What is the average retention time of your customers?
What is the breakdown, based on reasons for churn?

I am well aware of the fact that there are no integrated solutions for the SMB supporting a database for these crucial KPIs, but every company should have some form of repository capturing at least some of the data and a easy way of extracting it.

The important issue here is that SaaS companies should be aware of these KPIs and start asking these questions, even if they do not yet have all the answers.

Saturday, January 24, 2009

Of Dinosaurs and Men – Why Traditional ISVs Will Fail On SaaS

Dinosaurs were an extremely successful model. They roamed the earth and multiplied and were the indisputable rulers of this planet for hundreds of millions of years.

They were not very fast, necessarily, nor even nice to their customers (some of them are reputed to have actually eaten their customers), but they were successful because they had spent millennia adapting to the existing environment, perfecting their model to make the most out of it.

Then an unexpected event occurred (some scientists believe a meteor hit the earth creating a nuclear winter while others claim it was fast, cheap internet and tightening IT budgets) and soon all dinosaurs went extinct. Well, not all. Two groups were not annihilated. There were those alligators that stayed in their swamp (niche) and are doing pretty well, thank you, still today. The other group consisted of small reptiles that were driven to grow wings since the larger crawlers ate up all the easily available resources. These birds were lucky to be able to adapt quickly to the changing environment and survive the downfall of their relatives.

The fast moving, warm blooded mammals were better equipped to deal with the new brave world and many have grown to become true behemoth.

In a previous post I revealed the fact that I have mostly stopped advising the traditional on-premise, enterprise, perpetual, software vendor on the transition to on-demand, subscription model, i.e. SaaS.

This is not because I do not believe that it is a smart move, or that the ISVs would not benefit from the transition. Far from it! I envision a world, not too far in the future, where on-premise software would be the exception, not the rule, and even that exception would point to a dwindling model that would survive in niche markets only (swamps) .

My experience, which is supported by many famous (SAP and Avaia for starters) and less famous companies, has been that most traditional, on-premise, enterprise ISVs will fail miserably in the transition to SaaS. I have advised to companies that started out with great enthusiasm that dwindled to a silent death. They simply do not have the DNA for it.

I am talking about the right STUFF that is inherently lacking in established enterprise ISVs that will allow them to make the successful transition. This is not a comment about these companies’ value or success. It is usually inversely proportionate. The more successful the company is, the more entrenched it is likely to be in doing things the ‘right way’ – right, as far as the traditional model dictates.

These ISVs have a product view, not a service view. Their emphasis is on features not serviceability. There is a lot of push back from every silo in the organization, for change, in general, and the SaaS change in particular. It requires a paradigm shift in the organization, and the bigger, more established that organization is, the more difficult it is to bring about that change. (See Impact on the ISV Organization July 02, for a detailed account)

Until a couple of years ago, one could say that most ISVs just don’t get it. But that is no longer the case.

Many traditional ISVs saw their market share being cannibalized by these fast moving SaaS companies. Many heard their customers ask about an on-demand offering and many understand that it is vital that they have a “me too” offering. One cannot ignore the changes in the market and shrug it off as a fad. SaaS used to be a way to work around IT; now CIOs are building on-demand strategies for their business and even starting to use on-demand tools in IT.

So, there is a much deeper understanding of the need to offer an on demand service, but very few ISVs understand that it means a total commitment from the executive level and down.

Not that it is impossible. I have worked with a company whose board made the decision to go Services. They replaced the CEO, who in turn replaced all the senior staff, save the VP engineering. The new VP Sales brought in a fresh new sales force. Then they went through the process of rewriting most of the application from scratch. This process took about a year. They are now a successful SaaS vendor, but they got as close to re-encoding their DNA as possible.

And, of course, there known successful enterprises such as Oracle on demand, HP SaaS (former Mercury Managed Services) and others that had successfully launched their on-demand services, but they are the exception to the rule.

Dinosaurs were magnificent creatures and it sad that we don't have them around any longer (except on isolated islands in the Pacific), but their only fault was that they were too successful for the 'old world' model. I wonder how many software alligators will still be around a decade form now.

Thursday, August 07, 2008

Your Typical SaaS Operations

Personal note: For those avid fan(s) that have wondered why I have disappeared for such a long time. I took a long, forced vacation, having been sucked into what is not commonly known as the SaaS Operations Black Hole. I went under the radar as VP Operations and Services in a SaaS company, and although it did not leave me time to write blogs or visit the bathroom, I have collected an arsenal of good stuff from hands-on experience, both good and bad, which I intend to share, time allowing.
(One of the reasons I changed my vocation as a consultant to on-premise companies who were contemplating going on-demand, was my realization that it was mostly a futile attempt. It is the equivalent of turning slow moving, cold-blooded dinosaurs into fast, warm-blooded mammals without the benefit of a few hundred million years of evolution. But more on that in a future blog.)

Very well, back to the subject at hand. I have been visiting, talking to, sharing with, advising many SaaS startups in the SF Bay Area in the past year and a clear (actually murky) state of affairs seems to emerge

Company Profile:

Name: YTSC (Your Typical SaaS Company)

Age: 3-4 years in the making.

Staff: between 20-40 people, possibly a small dev team offshore in Southeast Asia or one of the former Soviet Union republics.

Technology: Being relatively fresh, YTCS technology is multi-tenant, customer-centric, with (hopefully) an automatic customer on-boarding mechanism, and they surely have an integration with Salseforce.com (or perhaps NetSuite, SugarCRM, MSDynamics, etc.). Some fancy configuration capabilities should be built into the product and Web Services integration options are available.

Platform: most probably a LAMP shop. Let’s start with all the free stuff and hope to reach profitability before loading the heavy guns. And, hey, we’re big advocates of open-source.

Sales force Compensation: Hmm, we read all the papers, attended the webinars. We think we’re getting it right, but why does the Sales department feel like Grand Central Station?

Customers: A lot of mom & pop shops, a bunch of WEB 2.0 companies with flamboyant logos, a number of departmental customers with big names that we flash on our web site.

Profitability: Surely by next year.

YTSC is now poised for accelerated growth. The customers seem to like the service and the price, and it looks like the numbers will grow rapidly; at least this is what YTSC’s newly acquired VP of Sales has projected.

So how is YTSC prepared for this rapid growth? Do they have the People, Practices and Programs (P-cube) in place? Are they ready to scale from dozens of customers to hundreds and, hopefully, thousands?

My guess is NO. Let me think about that for a moment… Naw.

So why does it look unpromising? Being typical, YTSC Operations has the following traits:

Operations is under the auspices of Engineering. There is no VP of Operations; there is no Operations group. A Sys Admin is managing the production servers and probably doing office IT on the side.

The CTO is responsible for uptime, availability and performance. Does the CTO have an Operations background? I'll bet my lunch money that he doesn't. Is there a Staging platform? Probably not. Can the engineers log into production servers and modify configurations? Yeah. Actually we just fixed that nasty bug during lunch break.

There is no application-level monitoring in place, or trend analysis.

Is Customer Churn being tracked and analyzed? (What? What was that?)

There is no 24X7 support, although YTSC claims it is a 24X7 shop.

Are the following crucial practices defined and followed?

Change management – the cause of over 60% of downtime is caused by good intentioned modifications to the platform. Is there a proper process in place? Is there an RFC (Request For Change) form and procedure? A change committee?

Incident Management – are Support, Operations and Engineering aligned in a well rehearsed routine; roles and responsibilities defined? Is there an Incident management system in place? How about a knowledgebase?

Configuration Management – are hundreds of moving parts accounted for? Are they linked into the Change Management process – actually, we don’t have a Change Management practice.

Availability Management – how do you analyze unavailability? How do you “budget” downtime? Do you know where to invest your next Dollar to ensure optimal availability? It should be all tied into an Incident management system. But, wait we don’t have one.

Release Management – how, when, how often, naming conventions. How does it tie into Change Management and Configuration Management?

SLA Management – Are we providing what we promised? Are we tracking effect of incidents on customers? Are we compensating them according to our contractual commitments? Is it tied into our (hosted) CRM solution? Hard to do without an Incident Management system.

Are we any better than we were last month, last year? Can anybody tell?

No doubt, parts of these practices have been in place with less fancy names. Otherwise YTSC would not have survived this far. But Excel and Notepad will not suffice for a large scale operation.

Most companies understand that (or maybe that is wishful thinking), but when having to chose between investing the next Dollar in great features that customers have been begging for, or that ugly, boring, misunderstood, 800 pound gorilla, they will opt for the former. Pay now or pay later.

I will cover some of these practices in future posts, Google willing.

Thursday, December 07, 2006

The Central Role of Operations

"Computers are useless. They can only give you answers." (Pablo Picasso)

The Operations group is an odd duck for the traditional, on-premise, enterprise ISV. Those ISVs that are transitioning to the SaaS model are typically not familiar with this group, its role and perhaps its reason for being, and in some cases you might find Operations reporting to the CFO as a ‘cost center’.

But in a SaaS shop, the Ops group is the hub of all activity. Its crucial and main job, of course, is to ‘keep the lights on’ and do that in a highly available, quality performance fashion. Maintaining a scalable, fail proof service is a task that the Ops group should, in time, perfect to the notion of ‘auto-pilot’, implementing the Automate and Delegate principles (see Reducing SaaS Operational Costs).
But that is not where the job ends; indeed it is only the beginning.

In some early stage SaaS operations (either a pure SaaS player or an ISV in transition), R&D and IT provide that function. IT is usually incapable of running, scaling and maintaining the application; its tool set, capacity and pace are so removed from an application level, 24X7 operation. R&D is in shock and awe: “you mean we have to use the damn product!!??” – they are usually the least capable of understanding how the application should work or the value to their customers.
Whereas R&D used to have dozens, hundreds or thousands of customers, Operations is now the only customer (or in a hybrid solution, the largest). All and every feedback to product marketing would come from the Ops group. It must develop a keen understanding of the application, not just the infrastructure supporting it, and it has to be in constant contact with its customers – the SaaS consumers – to gather feedback, compile it in an orderly and prioritized manner and be able to communicate it to R&D.
Since an Operations Support System (OSS) will be lacking in most early SaaS implementations, the Ops group will be the one presenting the technical solutions either through building its own tools, buying those apps or though cooperation with Engineering to provide the solutions. In any case, Operations will be the authority on the architectural needs, security, storage, the OSS, service-ready features and the application in general. Therefore Operations should be highly involved in the product roadmap.

In some organizations, ownership of the customer success may be in a separate silo that does not include Operations. But one must keep in mind that the Ops group works on a daily basis with the Network Operations Center (NOC) which doubles as the customer support center. Even if Ops and Customer support are not part of the same organization (which I believe they should be) the daily interaction between the groups means that Ops owns the customer success in many ways and deals with the customer directly as a Tier 2 support.

Operations needs an in-depth understanding of its infra and application performance issues and of principles of performance testing/monitoring. They need to work closely with the QA group to test and resolve load issues. In each rollout of a new version, Ops has the ownership of the project, the dates, process definition and should work in conjunction with R&D, QA, Marketing and customer support.

If the Service offered through this model allows for project based business, Operations will tend to be involved in defining offerings, help with the pricing and participate in scoping projects.
And based on my own experience, the Ops team, as the owner of the application and service, worked closely with a team of Expert Service engineers who provided the end consumers with domain-level expertise to drive more value from the application. (See IPaaS) Ops engineers also participated in user forums with the customers to provide best practices and tips n' tricks.

"OK! Fine!" I hear you shouting from the back rows "We are convinced of the central role of Operations. So what?"
So just keep in mind when you are about to launch this endeavor that you will need to assemble a good team of professionals to play in this game. Not just seasoned systems engineers, a network manager and a good DBA, but operations engineers that ideally have an engineering background, that are innovative, customer oriented and business savvy. Nothing short of the Fantastic Four