Data center metrics and standards guide
A comprehensive collection of articles, videos and more, hand-picked by our editors
Plenty of service-level agreements (SLAs) promise the world because there's little to no accountability.
When internal IT departments accept service-level agreement metrics, the penalty for failure is usually a slap on the wrist -- if the SLA ever comes up. It makes no sense to fire the staff managing the whole data center over a slow application response time. Businesses could outsource the application to a service provider, but the problem of setting and enforcing SLA metrics remains.
Most businesses have a mix of servers, storage, networking and as-a-service outsourcing partners supporting their technology platform, and each requires SLA accountability.
Are SLAs a means of punishing the IT team or outsourcing partners after a negative event? If so, how does this help the business? For example, if the SLA is between the organization and an external service provider, you might include financial penalties for negative events. However, if lack of access to a service prevented the organization from completing business transactions, how do you calculate the business' losses? It's unlikely that any external provider would pay anywhere near that cost in an SLA penalty. And how would you hold an internal team accountable when a financial ding hardly makes sense?
An SLA should ensure service levels are attainable. The SLA should cover how things worked in the past, how they work now and how things are likely to go in the future.
Predictive monitoring is key to a successful SLA. One SLA metric, for example, determines the acceptable speed of a service's response to end users. The document in this example states that a mean average measured over an hourly period of 500 milliseconds is acceptable.
When the application starts, this average comes through as 300 ms, which is below the SLA's set hourly period. However, the response time slowly increases over time as the business continues using the application: 350 ms, 405 ms, then 470 ms. If you only monitor the travelling mean, then IT operations still look to be well within acceptable parameters. But if you're predictively monitoring the system, you will see that response times are a problem, before they violate the SLA.
Fix the problem
Track the trend and take action before response times violate the SLA. A long-term fix must be applied, whether that means tuning code or scaling hardware.
Identify the cause of the problem. Is the slowing response time due to underlying problems with the application or due to an increase in users? If the application is underperforming, identify the root cause -- a memory leak, for example -- and plan a program-level fix. If the problem is crowding from additional users, plan to scale up data center resources for the application.
With digital growth and changing workloads, the service provider must advise the business on what is needed to stay within the service level, or if the expected service level is unrealistic. For example, if a glut of users is slowing down the application, a larger virtual machine might fix the problem, which costs more money. If the IT budget won't allow for more resources, negotiate cost alongside service level: Would it be okay to spend less, but allow the service level to drift out to a mean average of 550 ms, for example?
It's complicated in-house
Negotiating an internal SLA is more complex. Heterogeneous environments of owned data centers mixed with colocation and cloud-based services mean the "master" SLA needs to refer to agreements with the IT department and with external service providers. Take the same approach with each subsidiary SLA; the aim is to create a flexible relationship with the provider to avoid any downtime or surprises in performance or costs. Don't think in terms of penalties.
SLAs should be living agreements, not documents that get forgotten. Review SLAs regularly to ensure the existing metrics are still fit for purpose. The business has to be involved with these reviews -- a response time of 500 ms agreed upon three months ago may be too slow today because a competitor entered the market. With SLA reviews, the business will know what investments could bring the target response time down to 350 ms or 400 ms, instead of simply setting an unreasonable SLA with unenforceable penalties.
Only discuss penalties when the SLA has failed. At this point, the relationship between the two parties will have failed as well. Have a business-oriented Plan B: Who takes over for the people responsible for the failure of the SLA itself? Can you truly fire the whole IT department? This should be a last resort.