This content is part of the Essential Guide: Colocated data centers uncut: Avoid these costly errors

How to finesse a data center service-level agreement

Doing your homework before you sign a service-level agreement will pay off in the event of an emergency.

Economic necessities require more companies to loosen the reins and put their applications, equipment — and trust — in the hands of a data center facility provider. Now more than ever outside services are part of the fabric of your company. How do you take steps to protect yourself in these arrangements?

As part of the master agreement, service-level agreements (SLAs) outline the scope and cement the terms of your deal, but not all SLAs are created equal. Reading the fine print and scouting out the terrain will help you hammer out a contract that will ensure you receive the level of service your business needs.

Peter Sclafani has negotiated SLAs for dozens of clients, ranging from IPv6 Software as a Service (SaaS) to colocation. As chief information officer at network automation firm 6connect, Sclafani has been on both sides of the SLA process and has reviewed hundreds of contracts in his role as either a customer or a vendor.

Question: The “master agreement” contains the service-level agreement. As the customer, where do you typically look to make changes?

Peter Sclafani: With a service-level agreement, I usually take a step back and look at the documents that are attached to it. Make sure that you have all of the referenced documents available for review. Your typical SLA, as a component of the master agreement, is going to have the same essential sections:

1) A definition of the product/service covered by the SLA. This section is usually off-limits with the provider since it is just defining the product or service. It’s not a good sign if you can't agree on this.

2) A definition of what “available” means and the service parameters surrounding it.

3) A description of the vendor’s response and escalation process when SLA conditions are not met.

4) A definition of the remediation process when the SLA conditions are not met.

You can probably imagine that sections two, three and four are where the edits typically happen. These are the details that are likely to benefit the provider by default. For example, with a data center, a network that drops packets is bad and may make your application unusable, but if it's within the established parameters of “availability,” you don't have solid ground for a remuneration claim.

Unfortunately, the complexity of the SLA will vary widely depending on the particular services being contracted. In a data center agreement, your typical SLA could cover everything from environmental conditions such as temperature and humidity, network connectivity issues like latency or uptime and even managed/cloud services involving servers, virtual machines, backup services and so on. With all of these possibilities to consider, the real issue with an SLA is going to be the vendor's flexibility for legal review. Assume there will be several rounds of feedback and changes, but it's important to balance the contract value with your legal fees.

Also remember the cardinal rule of any SLA when it comes to remuneration: You will not get any more than you put in. If there is an outage or other loss of availability under the SLA, you will never be entitled to “damages.” Remuneration will never exceed what you are paying for the particular service on a monthly basis.

Question: What environmental variables in a service-level agreement are the most open for discussion and negotiation?

Sclafani: The two most common negotiating areas in a service-level agreement are going to center around the definitions of availability or uptime in addition to the credit process.

Uptime is where you can specify what environmental variables are being specified for the product or service. With data centers, this would probably start with temperature and humidity and could easily get into network latency and application-specific performance.

The negotiation of uptime will normally cover the actual metric and the interval, such as temperature and time. For a colocation SLA, the conversation would focus on the acceptable temperature range and the time interval that you could be “out of range” before it counts as an event.

It’s OK to negotiate, but it’s also important to be honest with your vendor on the gear you plan to use. Suppose that you commit to a certain electrical load and the vendor agrees to maintain a certain temperature threshold. If you install your gear in the cabinet so hot spots occur, you have a situation where the vendor may not be responsible when the client has caused an outage.

Also consider that your actions as a client may cause an SLA event. For example, suppose you commit to an electrical load of 6 kW per cabinet. If you push the electrical circuits above 80% utilization, that might be categorized as an “out of SLA” event on the client’s part and trigger actions by the vendor. We have seen contracts that discuss client notification and levying of fines for time the circuit is over the established threshold. We have also seen contracts where the vendor holds the right to unplug hardware to reduce the power to within the 80% threshold!

Another point of contention is the method vendors use to calculate credits for remuneration. Make sure you understand their process and terms so there are no unpleasant surprises, such as a cap on credits. Sometimes these terms can be negotiated but success will depend largely on the particular vendor.

Question: What areas should a potential client educate themselves about prior to a service-level agreement negotiation? How are maintenance windows and other downtime situations handled?

Sclafani: The negotiating leverage is actually knowing competitive service-level agreements and shopping around between vendors. An SLA will always be reasonable to the vendor that wrote it, but they will generally work with you to ensure the terms are palatable.

Don't fall into marketing traps. For example, some vendors may make an outlandish claim like a 1,000,000% service credit for failing to meet portions of the SLA, but the limit is what you pay for the service. Don't expect to be taken seriously if you bring that up in an SLA edit. But a vendor's 100% uptime guarantee – with minimal credits if downtime occurs – is something you can negotiate.

The most effective way to negotiate an SLA is to understand the areas the vendor controls. For example, suppose a vendor is subleasing data center space, and your request is out of scope with their landlord. It’s not reasonable to expect the vendor to agree to terms that would open up their business to liability or other issues out of their purview.

You may be able to negotiate changes with the providers’ services. For example, if their network management is outsourced or through a third party and your request is reasonable, your vendor may actually be able to work with their service provider for an SLA modification.

When you consider the impact of third parties on a provider’s SLA, be aware of the maintenance windows. The necessity of planned downtime can be abused to accommodate an outage and ensure the vendor doesn't have to issue credits. For example, what if a network engineer botches a router configuration? Some providers may dub this a maintenance window to skirt their liability. If your SLA excludes maintenance windows, you may have no recourse, but you still have downtime. Ideally an SLA applies to all realized downtime which is independent of any maintenance windows, emergency or otherwise.

Be aware of what SLA breaches do to the integrity of the contract. There should be a clause that allows you to get out a provider’s contract without termination penalties if there are too many outages within a certain timeframe or a single extended outage beyond an established duration. For example, if the vendor has three outages in a month, you should be able to take your business elsewhere. Or if you have a single sustained outage, there should be a similar contract termination options trigger.

Question: How do you track and monitor SLA metrics such as support, problem escalation, response time and support staffing terms?

Sclafani: These areas are pretty important and really highlight the potential shortcomings of what happens when something goes wrong. Depending on the services the service-level agreement covers and the redundancy available from a provider, you may have several different thresholds of support and escalation. For example, if you’re engaging a fully redundant 2N facility, you may be comfortable with an "on call" facility contact that is available 24/7 but is only onsite during normal business hours. If your hardware is at an N+1 facility, then you may feel better to have a 24/7 facility contact onsite.

Is there a portal where you can see the same SLA metrics information as the vendor? Is there a ticketing system or dashboard where you can see the status of your services? Understanding the support and escalation process is crucial to making sure your expectations as a client are realistic, and that you are aware of what is expected of you. For example, you may have an SLA with a colocation facility that stipulates temperature, but the vendor managing the facility may not give access to temperature data. You would have to deploy your own monitoring and alerts system to verify potential issues. This may not be practical or even possible.

Next, understand how to deal with outage credits. When an SLA is invoked during an outage, the "outage clock" — and thus your potential credit — may not start until certain conditions are met. Many times, the vendor requires submission of a trouble ticket and that is the timekeeper. Other vendors may require you to call their support or even have a minimum outage threshold before SLA credits kick in. These are all key areas to identify before you sign on the dotted line. To even respond to an outage, you will probably have to do some monitoring yourself to verify your vendor's reports. This can happen with outages, temperature fluctuations or even 95th percentile billing. Mistakes do happen!

The SLA may include a clause to guarantee some level of response by the vendor to address an issue. For example, a vendor might have a trouble ticket response SLA of one hour and may charge an additional amount for a faster response. When tracking responses, it is also important to verify how this time is calculated and tracked. Some vendors use the auto responder of the ticketing system as their "first response." It’s a neat little trick that absolves them of that SLA requirement. Ensure you understand how their support structure is put together so your expectations match what you are paying for.

Staffing is usually glossed over and the focus is put on responsiveness. While I appreciate responsiveness and talking to someone during an incident, I would much rather get someone on the phone with operational experience to help me from the minute I call. When looking at providers, I always like to get an idea of the background of the support staff.

I generally get two responses. The prospective provider may be transparent and will share some background information regarding who will work with my hardware. I can ask them questions and get a feel for how their past experience relates to the services I am signing up for. Or I’ll encounter a complete black hole. The provider will tell me “everyone is super experienced” but offers no further details. When pressed for operational experience examples, I may get no response or something that shows the support team is inexperienced and will usually involve escalating up multiple tiers.

Look at your budget and determine the relative importance of provider support. If you have the talent in-house to deal with it, the lower experience level of a provider’s personnel may not be an issue. If you are trying to offload work from your team or add a skill set, it’s important to take a closer look at the provider’s staffing.

Question: What kind of due diligence can a potential client perform to test a provider’s escalation process before committing to a service-level agreement?

Sclafani: Since committing to a service-level agreement is an important part of the client/vendor relationship, there is no better test than “the call.” You know what problems your team has had in the past, so call their support line as if you were a customer during regular support hours and after hours. This is the most telling way to see who picks up, what their escalation process is and can present at least a mild simulation of an actual incident. It's also a good opportunity to talk with the support team and get a feel for what they will be like.

My other recommendation is to talk to current customers on their experiences, as well as to customers that are no longer being served. If you see a pattern of customer neglect or service failures, the most favorable SLA won't matter since you should be looking for another service provider.

Dig Deeper on Data center budget and culture