- whether you recover from disk or tape,
- where you recover to,
- and the size of your recovery infrastructure and staff.
Too often, the time to recover doesn't meet the objective due to "overhead" time. The following are examples of overheard time:
- the selection of available staff and determining DR recovery teams,
- actual declaration of the disaster and getting to the recovery site,
- and the general massive undertaking and overall chaos involved in initiating a recovery from a disaster event.
RPO-Data and RTO-Data
There are a couple other indicators that you may want to implement called RPO-Data and RTO-Data. The trailing "-Data" refers to situations where the recovered data is made available back to the application. It also includes at what time it is available. This is important because the end users and owners of your critical applications only understand (and pay for) the RPO and RTO specific to usability of application with an understood acceptable amount of data loss in a specific amount of time.
Your disparate IT teams, such as storage, server and network, all understand their specific roles in recovering from a disaster, which is why you test. However, once the infrastructure is recovered with the associated application data, there are typically many other tasks that are required to make the application usable and presentable to end users. Consider, for example, what the DBAs need to do to the databases and what the application and software teams need to do to validate functionality. Thus having these "-Data" metrics in place will help you test and checkpoint the recovery process to ensure that the real RTOs and RPOs are met successfully.
Finally, don't forget how parallelism can also affect your RTO. I worked with a client who ran quarterly disaster recovery tests – each focusing on a handful of mission-critical applications. More often than not, both the RTOs and RPOs were met successfully during these tests. But what wasn't considered (and uncovered during an actual disaster recovery event) was the massive parallelism in recovering all of those mission critical applications at the same time.
While RPOs were still met, there simply weren't enough servers, storage or staff at the DR site to recover all of the mission critical applications at the same time. Thus, the RTOs were basically discarded with best efforts and ad-hoc executive prioritization of application recovery thrown into the equation.
So the next time you're reviewing your RTOs and RPOs for disaster recovery, give a little more thought to what you can actually accomplish and what other metrics, tracking and approaches could help you to better achieve or even exceed those objectives successfully.
ABOUT THE AUTHOR: Bill Peldzus is Director of Data Center Services at GlassHouse Technologies Inc. He has more than 20 years' experience working in technical positions and often serves as a content expert in storage networking presentations. This was first published in September 2007
This was first published in September 2007