United, Delta and Southwest Airlines -- on top of a host of other well-known companies -- have recently suffered...
from a major data center outage. And their highly publicized shutdowns have added yet another worry to the IT executive's list.
Many of these data center crashes were reportedly caused by electrical failures, which doesn't come as a big surprise. According to the Uptime Institute, engine generator systems are the primary data center power source, with local utility power being an economic alternative. Utility power disruptions, however, "are not considered a failure, but rather an expected operational condition for which the site must be prepared."
In other words, these power disruptions are likely to happen in the majority of enterprise data centers. For CIOs who worry about this kind of thing their whole careers, this might be an opportunity to fund some needed improvements. But, be aware: Simply adding redundancy is not, in itself, the answer.
The challenge of mission-critical data center power design
The greatest vulnerabilities in enterprise data centers are hidden flaws and installation errors. There is a world of difference between simply duplicating equipment and true mission-critical design. However, it's a painstaking process to examine data center power design for potential points of failure. Consider hiring a highly qualified, independent specialist to do this task for your organization.
You can continuously review new or renovated facilities through design and installation, but it's another matter to remedy vulnerabilities in an existing facility while it is in service. When you correct vulnerabilities, you can expose the operation to failures. But even if you don't undertake risky corrections, know where the potential for failure lies to minimize the risk of a data center outage.
The false security of backup power
One of the most well-documented power failure outages in history happened at 365 Main in San Francisco. The company had redundant uninterruptible power supply (UPS) systems and generators to meet its customers' expectations of constant availability. But, on July 24, 2007, Murphy's Law paid an unwelcome visit.
First, there was a power failure. The data center's UPS maintained power until the generators started. But, soon after, the generators shut down one by one, causing a data center outage that affected a litany of the company's high-profile customers for hours.
Although the data center had a solid power system design, data center operators hadn't exposed the issue -- firmware in the generator control -- through commissioning tests. Rather than test repeated failures and generator restarts under load, administrators relied on the false security of backup power and redundancy.
Many modern UPS systems can signal servers to start a controlled shutdown when battery life has dropped below a preset threshold. While not ideal, it's far better to implement this capability than to experience a hard crash when restarts begin.
If you can fix a vulnerability, make a detailed plan for how you can do it, as well as how you would handle the potential failures that the remediation process could cause. For example, if an admin sets off a fire alarm, there should be someone with him who can deal with the condition and avoid the dump of a gas fire protection system and an automatic shutdown. And, if the plan is to turn off the fire alarm during the work, notify the facility, security and fire departments, and make sure someone stands by with a portable extinguisher. If there is potential for a cooling failure, plan to initiate selective shutdowns to reduce the heat load and place portable air conditioners as a precaution.
Minimize data center outage risks with commissioning
Even if a data center power design is perfect, there could still be errors that admins can only identify through commissioning. The commissioning agent not only looks at the correctness of the installation and verifies the proper settings and adjustments, but it also attempts to break your system. To complete a test, an agent uses a set of scripts, runs infrastructure systems under simulated conditions and shuts down various elements as if they have failed.
The commissioning process also includes a total power shutdown under load, and might introduce additional failures in individual pieces of equipment, depending on the level of availability used for design intent. The process should also identify unclear markings and unprotected or hard-to-reach critical controls, such as an emergency power off button without a protective cover and alarm.
For a new facility, begin commissioning in the design development stage. If you use an independent commissioning agent, make sure the agent identifies and remedies the majority of the potential flaws before you complete the project design. This not only reduces the chance of a data center outage, but avoids the potential for massive change order costs.
In existing data centers, it is too risky to do multiple shutdowns to look for problems, which means that full commissioning is impractical. In this case, consider a data center audit, which involves a combination of design review and on-site measurement, testing and inspection of critical systems. While it won't expose every potential condition, it can uncover the vast majority of vulnerabilities and provide a path to remediation when practical.
Avoid these common IT blunders that lead to outages
Follow this checklist for data center ops best practices
Boost IT resiliency with a distributed architecture