BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
Author's note: This story is based on true events. Every piece is from an actual occurrence that my students or I have seen first-hand.
It was 3 a.m. and my smartphone gave me an alert. I was getting alarms ten times a night since we installed the new data center infrastructure management system, but none had proven serious. In this case, the temperature in our main data center was within the American Society of Heating, Refrigerating and Air-Conditioning Engineers' allowable temperature range -- but over the company's operating limit, and rising.
Finance had decided on our new data center budget before anyone had established criteria or designs, and we were constantly having to compromise our disaster recovery policies to stay within it. I had insisted on extra air conditioners, and redundancy in our modular uninterruptible power supply (UPS) system. Despite everything, the designers had assured us we were Uptime Tier III, and there was no reason to spend money we didn't have to get it certified.
I called security. They were getting the same alarm, but no one was available to check it out. After awakening a facilities manager who said he would get someone there, I got dressed and headed to our building.
Powerless and under pressure
An hour later, I walked into a data center that felt like the Sahara. Lights were flashing everywhere, server fans were at full speed and all but two of our 10 air conditioners were dead. Some servers were already shutting themselves down. Suddenly, the disaster recovery policies I thought we put in place were beginning to crumble.
The data center infrastructure management display was confusing, and the graphical user interface made little sense after getting past the first menu. A table of numbers showed that the temperature had been climbing for several hours. Why didn't I get an alert earlier? I found an electrical diagram that looked like hieroglyphics, but I could tell it was for our UPS systems. I knew where to find the panels for our server cabinets, but had no idea about the mechanical controls. There were some electrical panels on the walls, but the labels made no sense. "LBTA-3" could have been anything, and the panel doors were locked anyway.
Once the facilities worker arrived, he confirmed what I already knew: There was no power to most of our units. He checked the breakers he could locate and found nothing wrong, but we couldn't go any further without an electrician. This required another call to the facilities manager, and another wait for the electrician to arrive.
One by one, I shut down servers to avoid catastrophic crashes. Soon the electrician arrived, and he knew where the electrical panels were -- in a room behind locked doors that we weren't able to access without his special key. He opened the door, and it was cool inside. This was also the UPS room, and its dedicated air conditioner was running. A single air conditioner meant our redundant UPS was vulnerable to non-redundant cooling.
Things heat up
Once the electrician reset the tripped main breaker, air conditioners started coming back to life -- but not for long. Flames crept through the small cracks around the panels of the electrical box. Our aspirating smoke detection system was supposed to alert us before anything got serious so we could solve it before the main fire protection system dumped gas. It had quickly picked up the smoke drifting into the data center, and ear-splitting alarms were going off. But instead of an early warning, the main system was already starting its countdown to gas release. There was no fire in the data center so I hit the override button, but that only initiated the countdown again. Firemen appeared at the door. It was the air conditioner power that had problems, not the UPS or server power, but they immediately reached for the big red EPO button. I yelled, but they hit it anyway. A few seconds later the gas dumped. The electrician headed for the basement to cut main power to the room, and the firemen poured foam on the burning box.
A cold reception at the DR site
When our overseas offices called me wondering why they couldn't access the office phones, I assured them that, based on our disaster recovery policies, they would be routed to our DR site. However, although we had contracted the site, we hadn't actually done a transfer of operations, meaning we hadn't moved our IT infrastructure -- either physically or virtually -- to the DR site. When I called the DR provider to declare an emergency, they informed me that the site wasn't maintained hot and ready to go. We had been doing daily data backups to the DR data center, but it was going to take time to get our user operations transferred. And we were going to need our own staff there to do it.
In the electrical room, the fire was out, the power was shut down and we were working under emergency lighting. As the electrician removed panels from the switchboard, he discovered the bus had exploded and taken out the second bus, too. I knew my only option was to get our IT services back in business at the DR site, and to reevaluate our disaster recovery plan.
Studies have shown that up to 75% of data center failures are due to human error, which means we can learn from the experiences of others -- including the incident described above.
You've learned about this data center catastrophe, but what about the mistakes that led to it? In the second part of this two-part series, we'll take you through the disaster recovery report that uncovers the errors -- and how you can prevent data center chaos from unfolding.