Someone walks into your data center. On the way through, they open the door of a big PDU and throw the main breaker,...
or go up to an air conditioner and turn it off. Calmly, you stroll over to the affected unit, restore it and go about your daily business. Is this how you would react if something like this occurred in your data center?
If not, you're unprepared for the kinds of things that can happen any day, at any time. Worse yet, you're unable to properly maintain your facility so as to minimize the chance of unplanned failures. It might be that something simply died. It might be that someone was meddling where they shouldn't. If you take testing and preparedness seriously, it might even be a consultant or compliance officer carrying out a random verification of sustainability. (Yes, that is really done some places.) The cause is immaterial. It's the effect that counts.
All equipment fails. It's not a matter of "if," but "when." And all of your infrastructure is doomed to early -- and even to recurrent -- failure if it can't be easily and regularly maintained. Service contracts are an important and very necessary first step. When properly written, they guarantee not only timely repair when something goes AWOL, but a regular inspection and preventive maintenance schedule as well. Equipment needs to be accessible for this to occur -- proper service clearances, not blocked by stored hardware, legal distances in front of energized parts, etc. But if there's anything that can't be shut down for hours or perhaps even days without jeopardizing your operation, there's no way to properly maintain it, with or without a service contract, and you're walking a very thin tightrope without a net.
The principle here is something we call "concurrent maintainability." Very simply, it means that anything in your data center can be shut down and kept down for a period of time, without directly affecting ongoing processing. Obviously, during this period you will lose all or part of your redundancy in some part of your installation, so maintenance shutdowns bring with them a certain level of exposure. But if everything has been well maintained, the statistical chance of a simultaneous failure practically drops off the charts. Further, to actually bring down a facility that is designed and installed for concurrent maintainability most likely takes not two, but at least three sequential events. This is the same scenario pilots are taught about that causes unintentional contact with the ground. One or two things won't do it. It always takes at least three failures and/or mistakes.
In the case of a data center a catastrophe might require:
- the initial maintenance shutdown,
- a second UPS failure, major power failure or the like and
- failure of the bypass, generators or whatever third level of protection is built into the system being maintained.
And, in most cases, a human error will contribute to the sequence. (See my first blog article at the bottom of this page.)
Not every data center justifies total "Tier 4" redundancy, but there are very few businesses today that can survive a very long outage in their processing. If the thought of a single item shutdown gives you nightmares, perhaps you should be showing this article to your management and suggesting that some investment in a little more robustness might be a worthwhile business decision.