This is the first part of our two-part series on data center power issues. The second part covers addressing data center power failure and its root causes.
The loss of data center power, even momentarily, is equivalent to a human heart attack. While organizations must make choices about data center design (i.e. Tier levels 1-4) and the level of redundancy, no one ever really expects to lose power. But power outages occur in both small server rooms with a single uninterruptible power supply (UPS) and in large multi-megawatt sites that were supposedly “fully redundant.” Here are some obvious and not so obvious reasons for outages and ways to mitigate potential power path issues.
The Power Path: Cost vs. acceptable risk
Data center design criterion is usually driven by cost versus “acceptable” risk factors. Details of the Uptime Institute’s Tier 1-4 requirements won’t be explained -- this would involve delving into more topics than just power – but the most basic power path is the simple “N” design (or Tier 1), which has no redundancy. Each component in the power path is a single point of failure (SPOF).
The power path starts at the utility and its high-voltage transmission lines. It then goes into a substation or local transformer and is handed off to the customer. The power is fed though the facility's main electrical panel and any subsequent subpanels, each containing a circuit breaker or fuse.
Assuming there is a generator, the power must next pass through an automatic transfer switch (ATS) and on to the main power panel that feeds the UPS. If the site is equipped, the power will also pass through a maintenance bypass panel (MBP). The main power panel also feeds other equipment, such as the cooling system components. From there, the power usually goes from the UPS (or MBP) to a distribution wall panel or floor-level power distribution unit (PDU) that contains branch-level circuit breakers for each circuit.
Then, the power typically goes to a receptacle in or near the cabinet to a rack-type power strip, which sometimes also has a circuit breaker or fuse. Finally, it’s on to the IT equipment's power cord, typically detachable, plugged in to the rack power strip.
It’s quite amazing that these sites don’t have power issues often. But these sites are fairly reliable and many smaller N-based data centers can run for years without mishaps.
Improving availability: N+1 power path
Certain strategies for improving vulnerabilities of a SPOF N-power system don’t require moving to a fully redundant Tier 4 type System + System (S+S) configuration, which consists of two complete independent sets of everything in the power chain.
The next level after “N” is “N+1,” which (in the power path) typically refers to the UPS and generator arrays in larger sites. In N+1, typically three or more UPS units connect in parallel to share the load. The N+1 design also allows the UPS array to support the load even if a single UPS fails.
There can also be a redundant A-B power distribution, where the power is located downstream of the UPS and goes to the cabinets fed from a single UPS or an N+1 UPS array. This improves distribution redundancy issues without the cost of fully redundant 2N UPS systems.
Finally, in a completely redundant 2N A-B system, each power path is independently capable of sustaining the entire critical load. The 2N A-B is typical of Tier 4 designs. Everything is duplicated: two separate utility feeds, two sets of generators and related ATS gear. Each N+1 UPS array is autonomous from the other and feeds separate A-B power distribution to each rack -- a true S+S power system.
Theoretically, this design should ensure that if any portion of either side fails, the other can continue to support the critical load. However, with Tier 3 and 4 designs, several points of intersection between the two systems allow each side to transfer power to the other side for maintenance and allow continued operation of the critical load. These common tie points represent a potential failure: The systems are now not totally autonomous.
If there is a problem during a power transfer from either human error or electrical equipment failure, it becomes possible to cause a total outage of the site. Ultimately, the complexity and added switch gear could reduce overall reliability instead of improving it.
Common failure points: Rack power strip and branch circuit breakers
The most common power-related problem at the micro level, usually occurring at the rack, is an overload of the rack power strip circuit breaker or the branch circuit breaker feeding the receptacle-feeding power strip. Unless there is a metered power strip, branch circuit current monitoring or a manual measurement done at the panel, it’s impossible to keep track of the current being drawn on an existing branch circuit. The result is equivalent to playing Russian Roulette whenever a new piece of IT equipment is plugged in. And even if you don’t immediately blow a breaker, it’s still possible that the circuit is near (or at) capacity. If the equipment increases its utilization under heavy computing loads, the power draw will also increase, causing the circuit breaker to open from an overload.
As mentioned, there is no guarantee of reliability, even with a 2N redundancy power path. One often misunderstood area by IT personnel is the circuit loading rules for redundant A-B power at the rack level. The only way to ensure redundancy is to ensure that the combined current (A+B) of each feed does not exceed 80% of the rated individual circuit value.
For example, a 20 amp branch circuit should only be loaded to 16A. Typically, this means that in cabinets filled with equipment with dual power supplies, each circuit should not be loaded past 40% of the circuit rating (i.e. a 20 amp branch circuit should only be load to 8A). If those numbers are exceeded when one side is lost for any reason -- either due to a local overload or an upstream failure -- when the load shifts, the total load current will overload it, causing a cascade failure and the rack to go down.
Moreover, with the advent of higher density power requirements of new 1U and blade servers, it’s more common to use three-phase power for each rack. The same rules apply to each of the three phases -- the overload of even one will trip out a three-phase breaker. Therefore, be aware that if any of the phases exceed the 40% level, there can be a loss of redundancy and a resultant cascade failure. The only way to avoid this is through real-time monitoring of every branch circuit, with threshold alerts that warn of potential overloads.
ABOUT THE AUTHOR: Julius Neudorfer has been CTO and a founding principal of NAAT since its inception in 1987. He has designed and managed communications and data systems projects for both commercial clients and government customers.Julius Neudorfer, Contributor