Most organizations have a strong focus on backup and restore policies. If something goes wrong with the IT environment, the goal is to have systems in place that can get you back to a known point in the fastest possible time.
The aim of any backup and restore strategy is to recover to a point in time in the past that is as close to the point when recovery is achieved. In technical terms, this means bringing the recovery point objective (RPO) and the recovery time objective (RTO) as close together as possible. With snapshots and virtual machines, the downtime is often hours or even minutes.
There still remains the problem that between RPO and RTO, you are incapable of carrying out business. The IT platform is either down or busy recovering. Until the data center is up and running again, the business is haemorrhaging: IT is causing it to fail.
Companies with the need for high availability and deep pockets have looked in the past at providing business continuity, whether with N+1 redundancy of IT components via clustering and virtualization, or even full mirroring of a live environment to a remote data center. While mirroring means a fully functional data center within minutes of a catastrophic failure, the problem has been cost. The investment is more than double that of a single data center, with not only an extra data center and equipment, but also all the software and tools required to monitor and maintain the systems and identify when a problem occurs and manage the switchover. Few businesses can justify this expense.
Times are changing though, and IT service continuity -- or something close to it -- is within the reach of most organizations.
The new IT service continuity plan
Your existing IT platform likely consists of a mix of single applications running on single or clustered physical platforms alongside virtualized systems and possibly a private cloud or two. You have VMs on the virtualized infrastructure and within the cloud potentially have containers on your roadmap. Container technology includes Docker, CoreOS's Rocket, Microsoft's Azure Drawbridge for Windows Server and Canonical's LXD Linux containers.
Start on your IT continuity plan by creating an asset database of the enterprise's applications. For most organizations, continuity doesn't mean mirroring all the same applications with the same user experience as the primary infrastructure. Instead, the business needs to be able to continue with core processes until the main data center is back on line.
A mission critical application running on a physical server must continue operating despite an outage, but it may not need to be replicated as a physical system. Running the app as a virtual machine allows IT to spin up the image rapidly when needed and provide a good-enough user experience as a stop-gap measure. A workload that is not deemed mission critical, for example a payroll or purchasing program, may be disregarded during outages.
Evaluate tools to manage transfers of workloads from the prime IT platform to the service continuity one. There are vendors such as Vision Solutions, with its Double-Take portfolio, that offers high availability and business continuity capabilities to move workloads from one environment to another.
Several products package and provision applications or containers from one environment to another, from vendors such as StackIQ Inc., Platform9 Systems Inc., Verilume and Electric Cloud (a vendor more focused on application release automation, but with capabilities for tying packaging and provisioning tools into a highly controlled and auditable manner). These tools do not need a hot target environment; they can provision to bare-metal, virtual or cloud environments dynamically.
By incorporating cloud and virtualization into a recovery plan, an organization does not need to pay for a mirrored data center -- it doesn't even need to pay for resources that it isn't using. It pays for the opportunity to use the platform as and when required for IT service continuity -- this expense should be within the reach of most organizations.
How to deal with data in an outage
This still leaves the knotty problem of data during and after an outage. You cannot easily package up data in the same way as an application. An application is a relatively static entity, whereas data is highly dynamic. You can use backup and restore, but with the same problems of RPO and RTO that a standard overall backup and restore strategy hits.
Database virtualization is a better approach. Vendors such as Delphix Corp. have tools to make copies of databases in a short time and can be used locally with little resource hit. It can use the same technology over a distance: An initial copy of the database is taken, and from there on, only deltas are copied across. On any failure in the prime environment, a live copy of data is available on that remote site. This does require a hot resource on the target site, but allows for very high levels of service continuity.
Once the prime site is back up and running again, the copy and prime databases synchronize, losing no data in the meantime.
About the author:
Clive Longbottom is the co-founder and service director of IT research and analysis firm Quocirca, based in the U.K. Longbottom has more than 15 years of experience in the field. With a background in chemical engineering, he's worked on automation, control of hazardous substances, document management and knowledge management projects.
Should you consider disaster recovery as a service?
IT service continuity costs are not for the faint of heart