Service Level Agreements (SLAs) are a hot topic in any organization. Many organizations are rolling out SLAs in an attempt to match IT infrastructure availability with the business objectives of the organization.
There is more to an SLA than just identifying a specific performance threshold, however. SLAs can also define support options and compensation for service levels not met. This article will discuss the Top 10 things any data center should consider when implementing SLAs.
Top 10 aspects of successful SLAs in the data center
- Fundamentals: The basic approach to SLA implementation revolves around these core areas: processes, technology, and people. Don't focus solely on the technology when implementing SLAs. In many cases it's not the network, hardware, or software that causes an outage. Assess your processes and your staff. Education and process validation are just as important in outage prevention as having N+1 architecture. If your staff isn't trained to support a high availability cluster you've deployed, outages could be even longer than they would be without the cluster.
- Measurable: An SLA must be quantifiable. Not only does this mean that it must be specific in the verbiage of what is being measured (monthly latency of the network will not exceed 95 milliseconds round-trip), it must also be a unit of measure that your data center is able to capture. For example, setting a 99.9% SLA on the availability of a Web app and using only ping to determine if a server is up or down is not a measurable SLA. The Web server could be functioning appropriately but the application itself may have stopped responding. Be sure you have the tools you need to accurately measure the SLAs you are setting.
- Manageable: Set SLAs only in areas where a service is critical. Setting too many SLAs means your IT department has to spend more of its time sorting out data, analyzing trends and responding to outages in non-critical service areas, instead of other higher priority tasks.
- Achievable: Make sure your goals reasonable for your infrastructure. If your network is at 90% utilization, you may not be able to guarantee certain levels of response times. You might need to set expectations and do some education with the end users and other stakeholders about what is a reasonable level of availability to achieve with the infrastructure you have in place currently.
- Define how performance is measured: Ensuring everyone understands how the metrics are gathered and calculated is important. Always set parameters to show what is being measured and when. For example: "Server performance will not exceed an average of 80% CPU utilization between 8am and 5pm Monday-Friday."
- Appropriate: Ensure SLAs are appropriate for your organization. The SLAs for a global conglomerate can and should be different from the SLAs for a government agency or a web-based retailer.
- Review schedule: For every SLA that is set, define a corresponding review schedule for that specific service area within your regular capacity planning and review sessions. Use the metrics you gather from this SLA to help shape your future expansion or service enhancement plans.
- Remedy Plan: Define how you will respond to outages. Will your users be entitled to an explanation of all outages, or just specific critical outages? What, if any, service credits will be paid to your customers, internal or external for outages? How will future outages be prevented?
- In line with IT processes: Ensure that the areas you are setting SLAs match up with the processes you have in place. A mismatch between the two could result in a situation where the staff on site is unable to remedy the situation in the mandated recovery window. For example, don't set an SLA for a specific application that states the app will back in service within a 2-hour recovery window if your onsite support agreement for your server hardware has a 4 hour response time.
- Exceptions: Define the situations under which SLAs will be not honored. This can include such things as natural disaster, holiday schedules, 3rd party vendor service failures, end-user misconduct, scheduled maintenance windows, or core business hours.