Data center disaster recovery is a very large topic and requires more considerations than we have room for here. For example, what facilities are critical: cooling, UPS power, access control, connectivity?
We can, however, make some general suggestions for major items to be considered in order to minimize infrastructure down time. But keep in mind that by definition a real "disaster" means down time. Thus, "disaster recovery" needs to be a well-planned, orderly, and regularly practiced process to get back in business. Avoiding down time means minimizing the effects of controllable problems to avoid an actual disaster.
- Documentation is key: Have thorough, consistent, easily understood documentation on every device in your data center. Have documentation in print attached to each machine. If the data center is down, you can't pull it up electronically. List what is running, who the users are, who is responsible for it, and how to contact them.
- Have a well thought-out "load shedding" plan: Consider color-coding machine tags for low, medium or high priority. This makes it instantly clear what gets shut off first, and what gets brought back up first. Notify users in writing that in the event of a problem, their service may be abruptly terminated and whose service will be restored first. This conserves power and UPS run time for the most critical applications.
- Have good battery monitoring capability on your UPS system: Batteries are the things most likely to fail under stress and you need to know if one is failing.
- Cooling: Can your air conditioning be maintained or quickly brought back on-line during power interruption? Remember that high-density servers may shut down within minutes or even seconds, if cooling is lost. Large chillers may have a restart delay of ten minutes or more, even if your generators get going right away. If your system uses chilled water or Glycol, have a storage tank so cooling can be resumed with pumps and fans before the chillers get back in operation.
- Have a locked-in scheme for sequentially transferring loads to the generator: This will help to prevent the generator from becoming unstable or dying when it gets hit with too much load at once. Let computer hardware run on the UPS until air conditioning is back in operation, then transfer the UPS load. Makes sure that your UPS is designed to "walk-in" its load, since battery recharge will require additional power. Hopefully it is input filtered so the generator isn't subjected to high harmonic loads.
- Concurrent maintainability: Make sure every part of your data center can be taken off-line for maintenance without disrupting operations or putting you at risk. Every part: air conditioner, pump, pipe loop, valve, chiller, UPS, power circuit, automatic transfer switch, etc. In addition to providing preventive maintenance, this will make it possible for you to bring everything that has not been affected by a disaster back on-line, bypassing anything that has failed or been damaged.
- Test, test, test: Test and re-test every piece of equipment. Test and re-test every piece of the plan and every procedure. Do it under conditions as "real" as possible without seriously risking your operation.
If you're afraid to randomly pull a plug, and people run in circles if you do, then you do not have a workable recovery plan. Practice makes perfect.
ABOUT THE AUTHOR: Robert McFarlane is a pioneer in the field of building cabling design. He has been asked to speak at countless seminars on building infrastructure for electronic communications, evolving technologies and the requirements of trading floor and data center design. Mr. McFarlane served for twelve years as President of Interport Financial, Inc., a firm specializing in designs for financial trading floors and critical data centers.