In this tip, Richard Jones makes the case for integrating disaster recovery into your configuration management processes, testing often and tracking recovery time objectives closely.
It's easy for disaster recovery plans to get stale. Companies are constantly in flux -- trying to keep up with rapid growth during good times or cutting expenses in bad times. If your DR plan is an afterthought, you will not be able to stay current.
A colleague of mine told me of a personal experience he had a few years ago when working for a large backup/restore technology provider. A customer had lost its email server (hardware failure including losing the disks). The customer stood up a new server and proceeded to restore the system from tape backup. The restored system failed to come up, indicating that the email database was corrupt. They restored the email database again and again but to no avail. They began to panic as time marched on well past their recovery time objectives (RTO) and business was seriously getting hurt. Frantically, they called their backup/restore provider's technical support line for help. After testing their patience going through the standard tech-support troubleshooting steps, they decided to send an expert on site to help with the recovery. The customer was convinced that the backup was corrupted or the recovery process was corrupting the data. The tech-support team was convinced that the data on tape was good after accomplishing their troubleshooting.
So my colleague was dispatched on site. By now, nearly two days had passed and the company was really frantic. He determined that the data on tape was good and that the restored data matched that on tape. The obvious conclusion was that the backup process had corrupted the data from the disk. But my colleague continued to dig into the problem. To make a long story short, the new server had been installed with a later operating system and email server patches than the dead server had been running. When the new server had attempted to mount the email database, it wasn't expecting an older database structure, and reported corruption. My colleague simply backed off a couple patches and the email server came up without a hitch
This story illustrates an area that many IT organizations overlook when building disaster recovery plans. They build out a plan but fail to keep it up-to-date. Obviously, our email disaster customer needed to adhere to ITIL best practices and keep disaster plans up-to-date with operating system revisions and patch details to avoid the disaster they experienced.
But it's not a just problem for smaller companies with one email server and direct attached disks. I talked to a large Fortune 500 company that had followed all the disaster planning processes of building a business impact analysis (BIA), determining RTOs, implementing and testing the solution to its satisfaction. The battery of disaster recovery (DR) tests they had performed back in 2002 had proven the solution was good and worked well within the RTO. Business continued to grow at a breathtaking pace, and because the DR plan worked well, live testing gradually gave way to more pressing business needs. Unlike our email disaster friends, the company kept rigorous documentation of its configurations thanks to a well-designed configuration management policy. The company felt it was covered.
An upper management change in 2006 brought in a new CIO. Reviewing the companies DR plan, the new CIO asked when the live system had been last tested. After some ums and ahs, the shocking truth came to light: The end of 2002 was the last live full test, "but it worked flawlessly back then." The newbie CIO demanded a test. While the test worked, the RTO was far from being met. Data growth between 2002 and 2006 had been so great that the fastest tape system in existence could not meet the company's RTO. The solution was to asynchronously replicate its data to SAN storage and systems co-located over 400 miles away in another state. The company implemented a co-located hot DR site and relegated tape to being an archive medium.
These lessons from the industry teach us a few things about DR.
- Integrate DR into configuration management processes
Many companies treat DR as an afterthought, a bolt-on and even a separately budgeted item. If your company falls into this category, you need to stop now. DR must be an integrated process and never an afterthought. As an afterthought, it is too easy to cut DR practices in times of stress, either from trying to keep up with rapid growth during good times or from cutting expenses in bad times. If you have a separate budget line item for DR, that needs to disappear. DR budget should be included in the core project costs and never after.
- Test DR often
Configuration management change events should include DR testing as a part of standard acceptance testing of any update or configuration change made to the system. Configuration changes that require some amount of staff training should include DR training as well. Refresher DR training should be an annual occurrence and should be tracked as a Human Resources training requirement.
- Watch your RTO trends
DR testing of systems should occur at least once a year. Tracking the RTO trends from test to test over time is a great way of forecasting whether current methods and technologies are keeping up with growth and change. Not only does data growth elongate recovery times or even high-availability failover times, but increased systems interdependencies and/or modularization can also increase the RTO of a business system.