Virtualization provides an extraordinary level of IT flexibility, and data centers can use virtualization to move and migrate workloads between servers–even between entire data centers–with little (if any) user disruption. Consequently, virtualization has taken on an important role in disaster recovery preparedness. But disaster recovery doesn’t just happen; disaster recovery requires careful planning and regular testing to ensure that organizations and staff can execute a recovery when bad events occur.
This month, the SearchDataCenter.com Advisory Board members shared their insights on disaster recovery planning in a virtualized environment. We asked how virtualization changed disaster recovery preparations, tools and procedures. What problems or challenges does virtualization add to data center disaster recovery? What are the effects on staff and their development?
Bill Kleyman, virtualization architect, MTM Technologies Inc.
The concept of virtualization and storing full servers on a storage area network (SAN) has revolutionized the idea of disaster recovery. In the past, disaster recovery had a lot to do with mirroring physical boxes, offsite backup and making sure that data centers could handle power loss or other environmental emergencies. With virtualization, IT engineers realized the potential for more successful disaster recovery planning; virtual platforms are easier to migrate, backup and recover. Almost every major data center has some sort of virtualization solution today, so it’s foolish not to change a disaster recovery plan to better accommodate virtualization technologies.
With a physical server, you would have to mirror or somehow replicate the workload to another physical box offsite. Sometimes (not always) this worked well, but it could still be clunky. With virtualization, it's always easier to simply spin up an image from a recent snapshot than it is to rebuild a physical box. Failover tools now allow entire workloads to be migrated from one physical host to another simply based on a hardware failure in the virtualization pool. Many times, this is even automated so that user interaction is kept to a minimum. Since virtual machines (VMs) are now stored on SANs, they can be replicated to the cloud or even mirrored through SAN-to-SAN technologies. All of this was much harder to do in a strictly physical environment.
But this comfort can get dangerous–complacency with disaster recovery planning is not a good thing. Virtualized disaster recovery works smoothly, but engineers still need to keep a constant watch on their environment. So know your technologies and how to best utilize all of the tools that come with it. Workloads can be migrated live and end users probably won't even know the difference. With that, exercise caution and make sure all of your virtualization platforms are updated and working properly.
Disaster recovery plans–physical or virtual–should always be tested. Depending on the size of the environment, some organizations run a test once a week, and some run it once a month. There are also a myriad of tests that can be run. For example, to test physical hardware failure inside of a pool, IT engineers may "pull the plug" on one of their physical boxes to make sure their VMs load balance to the next available server automatically. This tests network capabilities, failover and VM health as they automatically migrate to a physical host. Another testing method would be to verify that all snapshots being stored offsite are viable. Simply spinning up a VM from a snapshot to verify data integrity doesn't take long, but it helps engineers ensure the health of their environment.
Physical and virtual disaster recovery planning are different. In a virtualized environment, it's essential for the staff to be very familiar with their hypervisor platform–whether it's VMware or XenServer, engineers need to know how to work with and troubleshoot it quickly. Staff should be trained on all disaster recovery feature sets that come with their software, and they should test their environment regularly. The best way to learn is to get your hands dirty.
Bill Bradford, senior systems administrator, SUNHELP.org
Virtualization is a large part of our infrastructure and also makes disaster recovery planning easier. VMs are mirrored to other geographical locations, and critical systems can be brought up in those other locations in case of emergency or disaster.
I see virtualization making fewer problems for disaster recovery. Instead of needing racks and racks full of hardware, you just need a beefy cluster for VMware (or your virtualization platform of choice) and enough storage to keep your VMs on. What used to require a few racks of equipment can now theoretically be carried by hand on a large Serial Advanced Technology Attachment hard drive.
Testing and preparedness are important. Just like making sure everything can be brought back up on physical hardware, the process of restoring copies of the virtual environment on alternate hardware should be tested at least yearly, twice a year or even quarterly. Little problems like CPU revision feature lists and slight differences between generations of hardware can jump up and bite you at the worst times–like when trying to bring up critical systems after a disaster. It is better to find out about these problems (and any needed workarounds) during a test run, rather than with your boss breathing down your neck.
Make sure that all staff is familiar with the basic operation of VMs and the ways that they're different from physical boxes. The flexibility that makes VMs great for a lot of things can also be a hindrance when tuning and tweaking. Habits that work great on physical hardware don't do so well in a virtualized environment.
Michael Cote, analyst, RedMonk
In theory, the testing plans are the same, but there are different technologies to test. I mean, disaster recovery is all about coming back up as fast as possible after some unexpected, unavoidable bad event. Virtualization technologies may make this task better by auto-replicating across data centers in near real time, but you'll still need to test the same things–look at everything that might go wrong. You'll probably have a lot more networking and storage tests to do to ensure proper operation during disasters, because infrastructure components can be critical to successful virtualization-driven disaster recovery.
The biggest problem that I see is believing that everything is taken care of just because your sales guy told you disaster recovery planning would be done. You still need to routinely test it in real simulations, if not the real thing. For example, Netflix's Chaos Monkey plan looks mighty attractive for avoiding unexpected chaos.
You really need to test disaster recovery in action, under normal usage loads and abnormal ones, to make sure your networking and storage infrastructure can take it. Depending on how much data you'll be recovering, you could be shuffling huge amounts of data around. It's one thing to test
this on an empty network, but you'll want to see what it's like when the rest of the company is madly pressing reload to bring their email and Facebook pages back up during a disaster.
Easier doesn’t mean better
Ultimately, the flexibility of virtualization can be a major asset for disaster recovery. In fact, according to our State of the Data Center: 2011 report, 43% of IT professionals that use virtualization are applying the technology to disaster recovery. But it’s important to match this versatility with proper management techniques and a staff that’s trained and knowledgeable about the environment. Otherwise, virtualization can make a difficult recovery even more convoluted.