Just having a data center disaster recovery plan doesn't automatically mean that you'll hit your recovery time objectives (RTO); nor does it make your IT organization impervious to downtime.
Disaster recovery properly implemented means keeping plans up-to-date and testing them often. The following anecdotes describe some of the issues plaguing data center disaster recovery efforts in IT organizations.
Keeping plans current staves off recovery failure
A colleague recently relayed a personal experience he had a few years ago when working for a large backup/restore technology provider. A customer of theirs lost their email server; a hardware failure that included loosing the disks. The customer stood up a new server and proceeded to restore their system from tape backup. The restored system failed to come up, indicating that the email database was corrupt. They restored the email database again and again but to no avail.
They began to panic as time marched on well past their recovery time objectives (RTO) and business was seriously getting hurt. Frantically, they called their backup/restore provider's technical support line for help. After testing their patience going through the standard tech-support troubleshooting steps, it was determined to send an expert on site to help with the recovery.
The customer was convinced that the backup was corrupted or the recovery process was corrupting the data. The tech-support team was convinced that the data on tape was good after accomplishing their troubleshooting.
So my colleague was dispatched on site. By now nearly two days had passed and the company was really frantic. He determined that the data on tape was good and that the restored data matched that on tape. The obvious conclusion would be that the backup process had corrupted the data from the disk. But my colleague continued digging into the problem.
To make a long story short, the new server had been installed with a later operating system and email server patches than the dead server had been running. When the new server had attempted to mount the email database, it wasn't expecting an older database structure, and reported corruption. My colleague simply backed-off a couple patches and the email server came up without a hitch
This story illustrates an area that many IT organizations overlook when building disaster recovery plans. They build out a plan, but fail to keep it up-to-date. Obviously, our email disaster customer needed to adhere to ITIL best practices and keep disaster plans up-to-date with operating system revisions and patch details to avoid the disaster they experienced.
Keep RTO in check by testing plans often
But it's not a just problem for smaller companies with one email server and direct attached disks. I talked to a large fortune 500 company who had followed all the disaster planning processes of building their business impact analysis (BIA), determining their RTOs, implementing and testing their solution to their satisfaction. The battery of disaster recovery (DR) tests they had performed back in 2002 had proven their solution to be good and work well within their RTO.
Business continued to grow at a breathtaking pace, and because the DR plan worked well, live testing gradually gave way to more pressing business needs. And unlike our email disaster friends, they kept rigorous documentation of their configurations thanks to a well designed configuration management policy.
An upper management change in 2006 brought on a new CIO. Reviewing the companies DR plan, the new CIO asked when the live system had been last tested. After some ums and ahs, the shocking truth came to light: the end of 2002 was the last live full test; "but it worked flawlessly back then." The newbie CIO demanded a test.
While the test worked, the RTO was far from being met. Data growth between 2002 and 2006 had been so great that the fastest tape system in existence could not meet their RTO. Their solution was to asynchronously replicate their data to SAN storage and systems colocated over 400 miles away in another state. They implemented a colocated hot DR site and relegate tape to being an archive medium.
Checklist for staying ahead of DR curve
These lessons from the industry teach us a few things about DR:
- Integrate DR into configuration management processes
Many companies have treated DR as an afterthought, a bolt-on, and even a separately budgeted item. If your company falls into this category, you need to stop now. DR must be an integrated process and never an after-thought. As an after-thought, it is too easy to cut DR practices in times of stress, either from trying to keep up with rapid growth during good times or from cutting expenses in bad times. If you have a separate budget line item for DR, that needs to disappear. DR budget should be included in the core project costs and never after.
- Test disaster recovery plans often
DR testing of systems should occur at least once a year. Configuration management change events should include DR testing as a part of standard acceptance testing of any update or configuration change made to the system. Configuration changes that require some amount of staff training should include DR training as well. Refresher DR training should be an annual occurrence and should be tracked as a human resources training requirement.
- Watch your RTO trends
Tracking the RTO trends from test to test over time is a great way of forecasting if current methods and technologies are keeping up with growth and change. Not only does data growth elongate recovery times or even high-availability failover times, but increased systems interdependencies and/or modularization can also increase the RTO of a business system.
ABOUT THE AUTHOR: Richard Jones is the VP and Service Director for Data Center Strategies at the Burton Group. He has over 22 years of software engineering, engineering management, project management and product management in the power supply and networking software industry.