This content is part of the Essential Guide: Building a disaster recovery architecture with cloud and colocation
Evaluate Weigh the pros and cons of technologies, products and projects you are considering.

A disaster recovery report uncovers the errors that led to chaos

In this disaster recovery report, discover the failure points that led to catastrophe and learn how to make decisions that will lead to data center success.

Editor's note: In part one of this two-part article on data center disaster recovery -- which is based on real-world experiences -- an administrator woke up at 3 a.m. to manage an electrical fire, a blaring alarm and two exploded buses in his data center. Through this disaster recovery report, find out what went wrong and how the disaster could've been avoided.

It took a full day and half the night to get IT operations back in business at our DR site, and that was only for our highest priority systems. With a portable air conditioner, a temporary line and small uninterruptable power supply, we were able restore the phones. It would take weeks to replace the burned up guts of the big switchboard, but we also had to know what went wrong so it wouldn't happen again.

Below are the six failure points we discovered, and then noted in our disaster recovery report.

#1: Air conditioners

While we had extra air conditioners, most of them were powered from one switchboard. Only the two redundant units and the uninterruptable power supply (UPS) room unit were on a different power source -- an idea that the designer thought was logical, but actually negated the redundancy we had paid for. The trip current on the main circuit breaker hadn't been set correctly, and the engineers and contractors had not coordinated the breakers. So when one air conditioner developed a problem, the main breaker tripped instead of the single branch breaker, consequently losing 80% of the cooling. An infrared scan was done on the switchboard, but with only some of the air conditioners running. Without a full load, the bus didn't seriously overheat, so the loose connection that ultimately exploded wasn't revealed in testing.

The second switchboard was in the same electrical cabinet as the first one -- another decision that was made to meet the budget -- so the two power buses were right next to each other. When one exploded, it destroyed the one next to it, and we lost everything.

#2: Data center design

Another item we examined in our disaster recovery report was our data center design. Since our generator was for the whole building, the transfer switch was in the basement, ahead of the switchboard. It didn't sense an incoming power failure, but the destroyed switchboard would have stopped us anyway. With a shared generator, we should've had multiple automatic transfer switches with the data center on its own switch. That way, if the power were to go out in the data center and the rest of the building was not affected, the generator would start and the data center would receive emergency power.

We objected to the electrical room being accessible from the data center because we didn't want electricians coming through our compute area. We were ignored. With the electrical room air conditioner still running, and with the data center units shut down, the electrical room was at positive pressure. When the door opened, heat and smoke from the explosion poured out.

#3: Smoke detector issues

Our early warning smoke detector picked it up instantly, but it also controlled the gas fire suppression that wasn't set correctly. So instead of just sounding an alert, it triggered a gas dump as it sensed smoke. The smoke particles also contaminated the filters of all the equipment that was still running. The only good news was that the air conditioner in the electrical room was on the same circuit as the two redundant units, so it kept running. Without cooling, the UPS would have quickly overheated and shut down before the computer room. The UPS should go into bypass and maintain street power to the computers, but testing found the bypass wasn't wired correctly. With only one air conditioner, we were vulnerable in two ways.

#4: Prioritization

Our UPS could do an orderly server shutdown through the network, but we never hooked that up because of other priorities. We also learned we didn't really need that Emergency Power Off button, since we had no raised floor and weren't using containment. The engineers specified the most dangerous button in the industry "because every data center has one," but didn't include any cover or protection to prevent premature use.

Data center administrators are faced with an endless list of tasks. Learn how best to prioritize these responsibilities with strategies that you can actually put to use.

#5: DCIM alerts

When I asked that the data center infrastructure management (DCIM) tool alert me to only major alarms, the limit was based on ASHRAE's allowable temperature, which was higher than our data center's actual parameters for cooling temperature. Since our cooling was set at the previously recommended temperature -- much colder than it should have been -- the failure came well before the alert, costing valuable disaster mitigation time.

DCIM also should have shown that eight of our 10 air conditioners failed and what had caused it, but we didn't purchase the mechanical equipment module for the DCIM system, and therefore weren't alerted about cooling unit failures. This was also noted in our disaster recovery report.

#6: Lack of training and certification

We certainly needed more DCIM training, and the GUI was complex and provided so much detailed data that it was difficult to navigate. We tried to revise the GUI so we could see the big picture more easily, but it was not that configurable.

IT should have been included in the selection of this important system, and tested it similarly to how we benchmark other software before it's purchased.

We were definitely not Tier III, and a real certification would have revealed all these vulnerabilities. Our company cut too many corners in contracting our backup and DR site, but the failure to develop and test a real plan was all mine. As part of the disaster recovery report, we took a long, hard look at our DR site contract, and were able to make improvements. We also got help building a DR plan, which we now test twice a year by actually transferring operations.

Next Steps

Cost is more important than efficiency in disaster recovery

Challenges of ensuring key data isn't lost in a disaster

Top data center operational availability risks

Dig Deeper on Data center design and facilities

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

What do you look for when assessing the effectiveness of your DR plan?
So where's the part where we talk about who got fired? Did the IT guys take the fall, or the finance guys? 

My apologies for not replying to this sooner.  It has just come to my attention.

First, this is a “story”.  I was asked to write it to highlight the many things that can go wrong when designs, installations, operations, and management take place with insufficient knowledge, and/or out-of-sync with each other.  Hopefully, no single installation incorporates all these problems and mistakes at once, but I’ve seen some that come close.  You’d apparently be surprised, but there are some amazingly bad situations out there.  This is also why I’ve taught the on-line course in Data Center Facilities Management for Marist College the last nine years.  IT people, who today have so much responsibility for managing a complex physical infrastructure in addition to the demanding computing systems, need a lot more infrastructure knowledge to ensure that their operations run smoothly and efficiently in all respects.  Unfortunately, the majority of design engineers and facilities professionals are out-of-date, by years if not by decades, when it comes to data center power, cooling and other infrastructure systems.  Combine that with “territorialism” and poor budget decisions by “bean counter” mentality managements, and the results really can be truly disastrous. 

As stated in the forward to the article, every situation in the story has actually occurred in either data centers I have audited, data centers I have re-designed, or data centers my students operate.  (Personal experiences related to the class is part of the inter-active learning experience.)  If you don’t believe these kinds of “disasters waiting to happen” are all too common, just look at the colo and cloud hosting site failures reported virtually weekly.  To quote the bible, “I am legion, for we are many”.  Many enterprise data centers are even more vulnerable than the colo’s because they don’t have the leverage of hosting “high profile customers” to get what they need.  Further, their failures aren’t widely reported since major internet providers aren’t publicly affected.  We’re incredibly dependent on our IT networks and systems, but most of them have grown like Topsy, and the rush to provide services and keep up with demand has resulted in shortcuts more often than we might like to admit. 

So who is this POV guy?  Hopefully, not any one single person.  But I can assure you that he or she is a composite of many people out there who, through either ignorance, territorial restriction between Facilities and Operations, or lack of needed resources, could easily experience any one or combination of the problems included in this story.

I would also refer you to my article “Avoiding Single Point of Failure Design Flaws”.  It describes a litany of actual fatal data center design errors done by well-meaning engineers, also all taken from first-hand experience.  The examples in that article are not “composites”.  They are 100% real and unvarnished.  I can virtually guarantee that there are many installations out there with these or very similar design mistakes that were not caught in advance through a peer review or certification process.  I would also be willing to bet that many of the flaws related in the “story” also exist in numbers of operations today.