The complete answer to this question requires a book, and there are several on the market. We can make the following general suggestions for major items to be considered in order to minimize infrastructure down time. However, keep in mind that a real "disaster," by definition, means down time, and "disaster recovery" needs to be a well-planned, orderly, and regularly practiced process to get back in business. Avoiding down time means minimizing the effects of controllable problems to avoid an actual disaster.
- Have thorough, consistent, easily understood documentation on every device in your data center,
both in print and attached to each machine. (If the data center is down, you can't pull it up
electronically.) List what is running, who the users are, who is responsible for it, and how to
- Have a well thought-out "load shedding" plan. Consider color-coding the machine tags for low,
medium or high priority so that it is instantly clear what gets shut off first, and what gets
brought back up first. Notify users in writing that in the event of a problem, their service may be
abruptly terminated. This will conserve power and UPS run time for the most critical applications.
Also notify those whose service will be restored first. This should avoid wasted phone time telling
them what their priority is.
- Make sure you have good battery monitoring capability on your UPS system. Batteries are the
most likely things to fail under stress, and you need to be alerted in advance if one is going bad.
- Make sure your air conditioning can be either maintained or quickly brought back on-line in the
event of power interruption. High-density blade servers may shut down within minutes, or even
seconds, if cooling is lost. Large chillers may have a restart delay of ten minutes or more, even
if your generators get going right away. If your system uses chilled water or Glycol, a storage
tank may be a good idea so cooling can be resumed with pumps and fans before the chillers get back
- Make sure there is a locked-in scheme for sequentially transferring loads to the generator, so
it doesn't become unstable or die when it gets hit with too much load at once. Let computer
hardware run on the UPS until air conditioning is back in operation, then transfer the UPS load.
Hopefully, the UPS is designed to "walk-in" its load, since battery recharge will require
additional power. And hopefully it is input filtered so the generator isn't subjected to high
- Make sure every part of your data center is designed for concurrent maintainability; that is,
so that any air conditioner, pump, pipe loop, valve, chiller, UPS, power circuit, automatic
transfer switch, etc. can be taken off-line for maintenance without disrupting operations or
putting you at risk. This will not only enable thorough preventive maintenance, but will also make
it possible for you to bring everything that has not been affected by a disaster back on-line,
bypassing anything that has failed or been damaged.
- Test, test, test. Test and re-test every piece of equipment, every piece of the plan, and every procedure, and do it under conditions as "real" as possible without seriously risking your operation. If you're afraid to randomly pull a plug, and people run in circles if you do, then you do not have a workable recovery plan. Practice makes perfect.
This was first published in July 2005