BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
Editor's note: In part one of this two-part article on data center disaster recovery -- which is based on real-world experiences -- an administrator woke up at 3 a.m. to manage an electrical fire, a blaring alarm and two exploded buses in his data center. Through this disaster recovery report, find out what went wrong and how the disaster could've been avoided.
It took a full day and half the night to get IT operations back in business at our DR site, and that was only for our highest priority systems. With a portable air conditioner, a temporary line and small uninterruptable power supply, we were able restore the phones. It would take weeks to replace the burned up guts of the big switchboard, but we also had to know what went wrong so it wouldn't happen again.
Below are the six failure points we discovered, and then noted in our disaster recovery report.
#1: Air conditioners
While we had extra air conditioners, most of them were powered from one switchboard. Only the two redundant units and the uninterruptable power supply (UPS) room unit were on a different power source -- an idea that the designer thought was logical, but actually negated the redundancy we had paid for. The trip current on the main circuit breaker hadn't been set correctly, and the engineers and contractors had not coordinated the breakers. So when one air conditioner developed a problem, the main breaker tripped instead of the single branch breaker, consequently losing 80% of the cooling. An infrared scan was done on the switchboard, but with only some of the air conditioners running. Without a full load, the bus didn't seriously overheat, so the loose connection that ultimately exploded wasn't revealed in testing.
The second switchboard was in the same electrical cabinet as the first one -- another decision that was made to meet the budget -- so the two power buses were right next to each other. When one exploded, it destroyed the one next to it, and we lost everything.
#2: Data center design
Another item we examined in our disaster recovery report was our data center design. Since our generator was for the whole building, the transfer switch was in the basement, ahead of the switchboard. It didn't sense an incoming power failure, but the destroyed switchboard would have stopped us anyway. With a shared generator, we should've had multiple automatic transfer switches with the data center on its own switch. That way, if the power were to go out in the data center and the rest of the building was not affected, the generator would start and the data center would receive emergency power.
We objected to the electrical room being accessible from the data center because we didn't want electricians coming through our compute area. We were ignored. With the electrical room air conditioner still running, and with the data center units shut down, the electrical room was at positive pressure. When the door opened, heat and smoke from the explosion poured out.
#3: Smoke detector issues
Our early warning smoke detector picked it up instantly, but it also controlled the gas fire suppression that wasn't set correctly. So instead of just sounding an alert, it triggered a gas dump as it sensed smoke. The smoke particles also contaminated the filters of all the equipment that was still running. The only good news was that the air conditioner in the electrical room was on the same circuit as the two redundant units, so it kept running. Without cooling, the UPS would have quickly overheated and shut down before the computer room. The UPS should go into bypass and maintain street power to the computers, but testing found the bypass wasn't wired correctly. With only one air conditioner, we were vulnerable in two ways.
Our UPS could do an orderly server shutdown through the network, but we never hooked that up because of other priorities. We also learned we didn't really need that Emergency Power Off button, since we had no raised floor and weren't using containment. The engineers specified the most dangerous button in the industry "because every data center has one," but didn't include any cover or protection to prevent premature use.
Data center administrators are faced with an endless list of tasks. Learn how best to prioritize these responsibilities with strategies that you can actually put to use.
#5: DCIM alerts
When I asked that the data center infrastructure management (DCIM) tool alert me to only major alarms, the limit was based on ASHRAE's allowable temperature, which was higher than our data center's actual parameters for cooling temperature. Since our cooling was set at the previously recommended temperature -- much colder than it should have been -- the failure came well before the alert, costing valuable disaster mitigation time.
DCIM also should have shown that eight of our 10 air conditioners failed and what had caused it, but we didn't purchase the mechanical equipment module for the DCIM system, and therefore weren't alerted about cooling unit failures. This was also noted in our disaster recovery report.
#6: Lack of training and certification
We certainly needed more DCIM training, and the GUI was complex and provided so much detailed data that it was difficult to navigate. We tried to revise the GUI so we could see the big picture more easily, but it was not that configurable.
IT should have been included in the selection of this important system, and tested it similarly to how we benchmark other software before it's purchased.
We were definitely not Tier III, and a real certification would have revealed all these vulnerabilities. Our company cut too many corners in contracting our backup and DR site, but the failure to develop and test a real plan was all mine. As part of the disaster recovery report, we took a long, hard look at our DR site contract, and were able to make improvements. We also got help building a DR plan, which we now test twice a year by actually transferring operations.
Cost is more important than efficiency in disaster recovery
Challenges of ensuring key data isn't lost in a disaster
Top data center operational availability risks