Server uptime and hardware failure guide
A comprehensive collection of articles, videos and more, hand-picked by our editors
Your enterprise's old data center has reached the end of the road, and the whole kit and caboodle is moving to...
a colocation provider. What should you be looking for in the data center, and just how much uptime comes from within?
A lot of the work measuring data center reliability has been done for you. The Uptime Institute's simple data center Tier levels describe what should be provided in terms of overall availability by the particular technical design of a facility.
There are four Uptime Tiers. Each Tier must meet or exceed the capabilities of the previous Tier. Tier I is the simplest and least highly available, and Tier IV is the most complex and most available.
Tier I: Single non-redundant power distribution paths serve IT equipment with non-redundant capacity components, leading to an availability target of 99.671% uptime. Capacity components include items such as uninterruptable power supply, cooling systems and auxiliary generators. Any capacity component failure will result in downtime for a Tier I data center, as will scheduled maintenance.
Tier II: A redundant site infrastructure with redundant capacity components leads to an availability target of 99.741% uptime. The failure of any capacity component can be manually operated by switching over to a redundant item with a short period of downtime, and scheduled maintenance still requires downtime.
Tier III: Multiple independent distribution paths serve IT equipment; there are at least dual power supplies for all IT equipment and the availability target is 99.982% uptime. Planned maintenance can be carried out without downtime. However, a capacity component failure still requires manual switching to a redundant component, which will result in downtime.
Tier IV: All cooling equipment is dual-powered and a completely fault-tolerant architecture leads to an availability target of 99.995% uptime. Planned maintenance and capacity component outages trigger automated switching to redundant components. Downtime should not occur.
In most cases, costs reflect Tiering -- Tier I should be the cheapest, and Tier IV should be the most expensive. But a well-implemented, well-run Tier III or IV facility could have costs that are comparable to a badly run lower-Tier facility.
Watch out for colocation vendors who say their facility is Tier III- or Tier IV-"compliant"; this is meaningless. Quocirca has even seen instances of facility owners saying they are Tier III+ or Tier 3.5. If they want to use the Tier nomenclature, then they should have become certified by the institute.
Your role in uptime
These Uptime Tier levels reflect availability targets for the facility -- not necessarily for the IT equipment inside. Organizations must ensure the architecture of the servers, storage and networking equipment, along with external network connectivity, provide similar or greater levels of redundancy for the whole platform to meet the business' needs.
The uptime levels may seem close and precise, however, a Tier I facility will allow about 30 hours of downtime per annum, whereas a Tier IV facility will allow for less than 30 minutes.
The majority of Tier III and IV facilities have individual, internal targets of no unplanned downtime; discuss this when interviewing possible outsourcing providers or when designing your own facility.
Although it's tempting to look at the Uptime Tiers as a range of "worst-to-best" facilities, your business requirements must drive the need. Consider a sub-office with a central data center for most of its critical needs and a small on-site server room for non-critical workloads. A Tier III data center may be overly expensive for its needs, while a Tier I or Tier II facility would be highly cost-effective. Tier I and Tier II facilities are not generally suitable for mission-critical workloads, unless they must be used and a plan is in place to manage how the business works during downtime.
Ideally, house critical workloads in Tier III and IV data centers. Tier III facilities still require a solid set of procedures around capacity component failures, and these plans must be tested regularly. Even with Tier IV, don't assume everything will always go according to plan. Simple single-redundancy architecture (each capacity component backed up by one more) can still result in disruptions if more than one capacity component fails.
Ask how rapidly the data center operator replaces failed components to ensure redundancy is restored quickly. Are replacement components held in inventory, or is a supplier contracted to get a replacement on-site and installed within a set timeframe? For a Tier IV facility, this should be measured in hours, not days.
While the Uptime Institute's facility Tiers provide a good basis for what is required to create a data center facility with requisite levels of availability around the capacity components, the group will not provide reference designs -- areas such as raised vs. solid floors, in-row vs. hot/cold aisle cooling, and so on.
About the author:
Clive Longbottom is the co-founder and service director of IT research and analysis firm Quocirca, based in the U.K. Longbottom has more than 15 years of experience in the field. With a background in chemical engineering, he's worked on automation, control of hazardous substances, document management and knowledge management projects.
Colo or cloud?
Negotiate your colocation contract