Addressing data center power failure and its root causes

Data center downtime is most often a result of human error. But with proper capacity planning and battery selection, among other steps, a data center power failure can be averted.

This is the second part of our two-part series on data center power issues. The first part covered data center power path issues and common failure points.

The most common macro-level data center power failure usually stems from the battery supporting the uninterruptible power supply (UPS), especially in smaller sites that have a single UPS with a single string of batteries. Some of the new modular UPSes also have modular batteries (N+1 or more) to minimize the single point of failure (SPOF).

One common improvement in data center battery reliability involves feeding the UPS with two or more parallel strings of batteries. However, if one of the strings has a bad cell, it could cause the other string(s) to also fail. But even with multiple parallel battery strings, it’s sometimes difficult to know the real condition unless there’s a cell monitoring system and/or regularly scheduled load tests. Otherwise, you won’t know the real issue until the moment of truth -- a loss of utility power with batteries that now support the critical load.

Batteries need maintenance, testing and replacement more than any power-related component. But unless there is an allotted budget for the procedure, it is often deferred or ignored.

To save money, organizations may decide to only replace a weak string of batteries. It’s OK to replace strings of batteries at different times as long as they aren’t directly connected in parallel on a common UPS’s DC bus -- this can cause problems.

The ideal design has independent strings that are each connected to separate UPS modules that form an N+1 UPS array. When configured like this, the failure of any one string will not affect another. For example, four 750 kVA UPS modules, each with its own battery string, means the system is 3,000 kVA in total, but would be N+1 rated at 2,250 kVA.

Even if a battery string fails, it wouldn’t affect the other strings, as the battery strings are not tied together. Ultimately, if you have two autonomous UPS systems that feed separate A-B power distribution to each rack, your critical load will still only be rated as 2,250 kVA, with no apparent SPOF.

The upside is that theoretically, you should have no data center power downtime, unless there is a human error. The downside is that you have poor UPS efficiency. Under normal conditions, each UPS operates at the bottom of the load curve (37% or less). At low loads, most older UPSes operate at less than 75% efficiency, yet many sites accept these low efficiency levels because replacing a functional UPS is disruptive and expensive. Some newer UPSes can provide 85% to 90% efficiency down to a 30% load level. Under low-load conditions, some new modular UPSes can also scale down the number of active modules to improve efficiency.

Energy efficiency vs. redundancy
In today’s PUE-driven green data center environment, it’s difficult to speak about power without mentioning energy efficiency. The most energy-efficient system is the simplest “N,” with no redundancy, while the least efficient is the Tier 4 type that has the most redundancy. However, most companies will trade off energy efficiency to avoid a data center power failure -- it comes down to which type of "green" is more important.

In the long term, power requirements are rising across the board. Hopefully, the use of more energy-efficient IT equipment will help slow the increasing power demands, but capacity or expansion planning is still warranted. It’s irresponsible to ignore this issue until your site is at maximum capacity, as it takes 18 to 24 months to build a new data center and typically six to 12 months for a major capacity upgrade (assuming the expansion wasn’t modular or pre-designed).

Capacity design issues
Often one of the first questions asked when designing a data center is, “What is the design target capacity over the expected lifetime of the data center?” For an enterprise data center, the design is usually based on the existing site(‘s) present capacity and future growth plans, including the number of racks and the expected power requirements.

For a colocation site, the design is based on the needs of the customer base. In the past, the new site was usually sized and built for maximum power. As a result, new sites were nearly empty when they were first commissioned and occupied. Now, because of the growing number of power requirements, many new data centers are now built modularly to lower initial costs. Building in a modular fashion also prevents underutilizing the UPS and cooling systems, which permits them to operate more efficiently and allows the site to grow as needed.

The bottom line on data center power failure
Ultimately, most statistics show that human error, not hardware failure, is the root cause of data center power failure. Some data center power issues are predictable and avoidable, and if a problem occurs, everyone involved wants to know how and why it happened. In most cases, power issues can be traced back to the original design limitations, the level of redundancy chosen and restricted budgets.

ABOUT THE AUTHOR: Julius Neudorfer has been CTO and a founding principal of NAAT since its inception in 1987. He has designed and managed communications and data systems projects for both commercial clients and government customers.

Dig Deeper on Data center design and facilities