In a recent review of the last several large disaster recovery (DR) projects I've been involved in, one trend is...
becoming more and more apparent: Most customers don't have their current data center in an acceptable state of operational recovery readiness.
This situation becomes even more obvious with plans that pursue an aggressive recovery time and/or recovery point objectives. When referring to operational recovery readiness, I'm specifically talking about the ability to recover from a non-disaster event, such as database corruption, a virus or an accidental file deletion.
If you've performed a business impact analysis (BIA), there should be documented recovery time objectives (RTOs) and recovery point objectives (RPOs) that are agreed upon, per application, and are designated for both operational and disaster recovery. Often, the operational objectives are more aggressive, as those are the types of events you will be recovering from 99% of the time – those daily mishaps that are at the crux of why you do backups in the first place.
A specific example of this was uncovered during initial meetings on a DR architecture engagement for a large Fortune 100 company. They'd just completed a BIA and had documented three different recovery classes for DR, two of them with RPOs that were less than 24 hours. The BIA was disaster-focused only, with no consideration of day-to-day operational recovery from non-DR events. After we performed an analysis on operational recovery, the sole process in place was nightly backups to tape, and their average nightly backup success rate was less than 75%.
We called another meeting to have a serious conversation about operational versus disaster recovery and pointed out that they were embarking on a project that would result in better recovery from an all-out disaster than from a file corruption or deletion.
Why? Nightly backups, by default, provide an RPO of 24 hours. If a mission-critical database or file is deleted before the nightly backup and a restore is requested, it will be from the previous night's backup and that data will be from 24 hours ago (or less). Yet that same mission-critical file or database was associated with a disaster RPO of four hours. So if that application owner wanted better recovery, it would curiously make sense to declare a disaster and incur a data loss of four hours or less rather than 24 hours or less from the nightly backup.
Adding to the complexity of the problem was that those same application servers were not deployed in a high-availability cluster. Lose a memory board or CPU, the application server crashes and the application is down for some period of time. If it was determined that the DR RTO for that application was four hours, wouldn't it make sense to deploy an operational recovery architecture that would ensure virtually no downtime in the likely, eventual event of a hardware failure?
The bottom line is that you must look at what your operational RTOs and RPOs are for recovery from the most likely events before embarking on a DR initiative, especially with aggressive recovery classes. Understand what your current operational recovery classes are, and the supporting architectures and processes behind them. There are many options currently available to support aggressive operational RTOs and RPOs, including continuous data protection, disk-based snapshots and mirrors, virtual tape libraries, and data deduplication. These technology solutions are viable and successfully deployed in many large data centers.
So get your data center operational recovery plan in order today and your DR project will have a much more solid foundation on which to build a full recovery schema for those worst types of events -- full blown disasters. ABOUT THE AUTHOR: Bill Peldzus is Director of Data Center Services at GlassHouse Technologies Inc. He has more than 20 years' experience working in technical positions and often serves as a content expert in storage networking presentations.