Recovery time objectives (RTOs) and recovery point objectives (RPOs) are perhaps the most important key metrics when architecting a disaster recovery solution. An RTO is the amount of time it takes to recover from a disaster event, and an RPO is the amount of data, measured in time, that you can lose from that same event. These two business-driven metrics will set the stage for:
- whether you recover from disk or tape,
- where you recover to,
- and the size of your recovery infrastructure and staff.
There are several RTO and RPO intricacies that
are important to be aware of. First, as the "o" in both stands for "objective," it is, by definition, a target. If an RPO is four hours, then the architecture must ensure data loss of four hours or less. Therefore, when testing or actually recovering from a disaster, you should track and document actual thresholds achieved, including recovery point and recovery time.
Too often, the time to recover doesn't meet the objective due to "overhead" time. The following are examples of overheard time:
- the selection of available staff and determining DR recovery teams,
- actual declaration of the disaster and getting to the recovery site,
- and the general massive undertaking and overall chaos involved in initiating a recovery from a disaster event.
By tracking and documenting actual versus objective – especially during testing – you will understand what is being accomplished in a given period of time. And you will ultimately defend future investments by honing your recovery methodologies and processes to better meet or exceed those objectives.
RPO-Data and RTO-Data
There are a couple other indicators that you may want to implement called RPO-Data and RTO-Data. The trailing "-Data" refers to situations where the recovered data is made available back to the application. It also includes at what time it is available. This is important because the end users and owners of your critical applications only understand (and pay for) the RPO and RTO specific to usability of application with an understood acceptable amount of data loss in a specific amount of time.
More on data center disaster recovery
Your disparate IT teams, such as storage, server and network, all understand their specific roles in recovering from a disaster, which is why you test. However, once the infrastructure is recovered with the associated application data, there are typically many other tasks that are required to make the application usable and presentable to end users. Consider, for example, what the DBAs need to do to the databases and what the application and software teams need to do to validate functionality. Thus having these "-Data" metrics in place will help you test and checkpoint the recovery process to ensure that the real RTOs and RPOs are met successfully.
Finally, don't forget how parallelism can also affect your RTO. I worked with a client who ran quarterly disaster recovery tests – each focusing on a handful of mission-critical applications. More often than not, both the RTOs and RPOs were met successfully during these tests. But what wasn't considered (and uncovered during an actual disaster recovery event) was the massive parallelism in recovering all of those mission critical applications at the same time.
While RPOs were still met, there simply weren't enough servers, storage or staff at the DR site to recover all of the mission critical applications at the same time. Thus, the RTOs were basically discarded with best efforts and ad-hoc executive prioritization of application recovery thrown into the equation.
So the next time you're reviewing your RTOs and RPOs for disaster recovery, give a little more thought to what you can actually accomplish and what other metrics, tracking and approaches could help you to better achieve or even exceed those objectives successfully.
About the author:
Bill Peldzus is Director of Data Center Services at GlassHouse Technologies Inc. He has more than 20 years' experience working in technical positions and often serves as a content expert in storage networking presentations.
This was first published in September 2007