Most organizations have an IT disaster recovery policy that was created some time back and is now unfit for its purpose. A business continuity policy is the next step to protecting enterprise workloads against downtime.
Relentless improvements in IT have created a much more resilient environment -- one that focuses on business continuity (BC) instead of DR.
So how do you define business continuity vs. disaster recovery? Business continuity plans cope with the failure of any aspect of an IT platform in a manner that lets the organization continue working. A disaster recovery plan's aim is to get the organization's IT resources back up and working after processes stop. The DR policy is essentially a safety net for when the BC policy fails; DR resuscitates workloads. Both BC and DR are critical for organizations, complementing and feeding back to each other.
N+1 vs. N+M
N+1 is having enough pieces of equipment plus one extra to handle any single failure. N+M is having more than one extra item, so workloads stay up even with multiple failures. The larger "M" is, the more capability to deal with failure -- but the cost goes up as well.
If you have 10 servers, N+1 (11 servers) would probably be good enough for continuity. If you have 10,000 servers, N+1 would probably not give enough headroom to deal with failure, so an M of 10 (N+10) would probably be a better target.
In the past, only the richest organizations with the strongest need for continuous operation could afford the costs and complexities of putting in place technical capabilities for business continuity, which is based on a highly available platform. The correct use of N+M (multiple extras) equipment, alongside well-architected and implemented virtualization, cloud and data mirroring between servers, will give the majority of organizations some level of business continuity in most cases.
Certainly a full BC-capable IT platform is not a low-cost endeavor. The organization must balance its own risk profile against the costs involved to decide how far the BC approach goes. At some point, BC implementation will get too expensive for the business to fund.
This is where disaster recovery enters the picture. If the business has agreed that the IT platform must be able to survive a failure of any single item of equipment in the data center, for example, then they must fund an N+1 architecture at the IT equipment level. The IT team now has one more server, storage system and network path per system than needed for regular operation. However, the data center is based on monolithic technologies, making the cost of implementing an N+1 architecture around the uninterruptable power supply (UPS), cooling system and auxiliary generators too high. In this situation, the DR policy has to cover what's needed if any items fail, as well as what happens if N+1 is not sufficient.
IT teams and business leaders must first agree on the maximum acceptable time to get to a specified level of functionality recovered -- recovery time objective (RTO) -- and what that level of functionality is -- recovery point objective (RPO). Don't leave this decision just in IT's hands; business leaders must be involved and fully understand what the RTO and RPO mean.
Best-case and worst-case business continuity policy options
The RPO defines how much data has to be accepted as lost. This could have a knock-on effect to how the business views its business continuity investment. For example, in an N+1 architecture, a single item's failure has no direct impact on the business, as there is still enough capacity to keep everything running. Should a second item fail, the workload or workloads on equipment will slow, or possibly fail to work.
With a slower response time, the RPO is to regain the full speed of response within a stated RTO. The RTO generally includes how long it takes to obtain, install and implement replacement equipment. If the RTO is short, the DR plan should require a certain amount of spare inventory at the data center or that suppliers offer same-day replacement delivery, particularly for large monolithic items such as a UPS. The DR policy should then include all the steps for installing and implementing the new equipment, along with acceptable timeframes for these steps.
If an equipment failure causes the workload to stop functioning, then the RPO defines how much data will be lost over specified periods: per hour or per quarter hour; in high-transaction systems, loss could be measured per minute or even per second. The effect on the RTO depends on the business's view of how many "chunks" of data loss are acceptable. The DR plan must quantifiably show how to meet the RPO within the constraints of the business-defined RTO. If it is a physical impossibility to balance RPO and RTO, go back to negotiations with the business leaders. Options include investing in a BC strategy for this workload or data center or lowering RPO expectations so that you can decide on a reasonable RTO.
This was first published in February 2014