I’ve heard some customers say they can use a stretched VMware High-Availability cluster as part of their disaster recovery plans. When they say this, they are often forgetting that VMware High Availability (HA) was never intended as a disaster recovery (DR) solution.
For VMware HA to work and for its technical requirements to be met, you would need a single network between two sites — what’s referred to as stretched VLANs — and shared storage. For a stretched VMware HA cluster to function, the physical distances between the two sites would have to be quite small — such that it would probably discount the recovery site as qualifying as a true DR location. If the recovery site isn’t far enough away, it might be caught up in the same disaster itself.
Finally, you would find that such a configuration would be at odds with VMware’s DR product called Site Recovery Manager (SRM). By bending and twisting a technology beyond its original design, you could unwittingly cause further complications when it comes to adopting other technologies from the same vendor.
Some customers have said that they don’t need a DR solution like VMware’s SRM because they can carry out a live migrate of their virtual machines (VMs) or vMotion in the event of a disaster. When you scratch the surface of their “solutions,” you quickly find the two sites are yards apart, so the second could never qualify as a DR site.
Additionally, they tend to forget that even though so-called long-distance migration might be viable for planned down times, it’s not useful for a true disaster.
You will often hear organizations saying that customer database services have a zero-downtime tolerance, which demands no data loss whatsoever. It is common that when these demands are unpacked individually, the actual availability requirements fall somewhat short of these imagined constraints. For this reason, availability discussions often feel like discussions about security where somewhat unreasonable customers demand 100% security when everyone agrees that no such security level exists without some kind of trade off against functionality. The same goes with availability, where customers often demand something they don’t actually need, and then incur costs they could have easily avoided.
When faced with this almost ideological requirement for zero downtime, I frequently ask customers what systems they had previously and what outages, if any, they have experienced. We then talk about what the impact of the outages were and how they managed to continue to operate. I also ask what the overall impact was on business operations, both during and after the event in question.
In most cases, I have been able to wring from customers that an absolute zero-downtime approach might not be needed. Instead, a more reasonable “nines” approach might be more appropriate, with five-nines availability equaling 99.999%. The idea is that you can achieve 99.999% uptime, assuming that you have selected the right technology in the first place.
It’s useful for those working within virtual data centers to make a tacit acknowledgement that 100% availability is not always a goal worth striving to achieve. The reality is that a lower level of availability may suffice, because there are always maintenance windows that need to be factored in for tasks such as patch management, where a planned outage may be mandatory. The bottom line is that you can’t possibly measure whether an application or service meets its availability target if you haven’t set an appropriate value on what you are trying to deliver in the first place.
In the case of service availability, it’s quite clear that solutions such as VMware’s Site Recovery Manager, VMware High Availability and VMware FT would not deliver. The reason being is that none of these technologies are service-aware. Their design remit is to deliver more availability to VMs or protect your host or site from an outage.
This isn’t a criticism but more of an acknowledgement that not all virtualization vendors offer availability all the way up the stack into the guest operating system (OS) and the services it provides. In an effort to be neutral, most virtualization vendors balk at delivering availability to services inside the OS. In fact, they make it a badge of honor to be neutral to the OS in an effort not to artificially reduce the adoption of virtualization by promoting or enforcing the use of one OS over another within the container that is the VM. The use of virtualization availability tools alone to deliver five-nines availability is another example of using the wrong tool for the job.
The alternatives really fall into three camps:
- The provider of the service inside the OS will have its own availability solution. This is not very common, but some ISVs do allow customers to scale out their solutions for both capacity and availability.
- The OS vendor offers an availability tool that is free of charge.
- You purchase a separate system that installs an in-guest that offers better availability than the first two options. This third option invariably involves an additional purchase of software, additional configuration and a certain amount of multi-vendor coordination. For some time people have wanted to escape this third alternative in an effort to save money and reduce complexity.
Every time there is a new release of virtualization vendor software, the virtualization community asks the question about whether the new functionality “kills” this kind of application clustering. Given the extreme availability constraints that some services invariably introduce, and some virtualization vendors’ reluctance to intrude into the guest OS level of the overall stack, it’s unlikely that this Holy Grail will ever be discovered. This is in marked contrast to the backup arena where VM backup has progressed in leaps and bounds in recent years. Nowadays a backup at the virtualization layer is as good as a backup agent installed into the guest OS.
Third parties in the availability space
At best, it seems more likely that the virtualization vendor will allow application programming interfaces into the VM, which will enable third parties to integrate more closely with the virtualization layer and offer availability to the OSes services. A good example is the recent release of VMware’s vSphere 4.1, where VMware High Availability now has an option called application monitoring. This will potentially allow third parties in the availability space, such as DoubleTake and NeverFail, to use this new VMware High-Availability functionality.
Availability tools for VMs are also getting plenty of attention lately, as the number of VMs that house business-critical apps grows daily. Most virtualization vendors now have some type of clustering solution that will restart VMs on other physical nodes, in the event of a critical host failure. For example VMware has its HA solution, whereas Microsoft has retrofitted its Clustering Service to support the restart of VMs. Similarly, Citrix has its own clustering services with Citrix XenServer HA. Frequently, the reasons these technologies fail is that there are minimum requirements that have not been met, and they have not been properly tested. You can break these requirements down into a simple checklist:
- VMs stored on shared storage — be it a VMware VM file system or network file system volume, a Microsoft Clustered Shared Volume or Citrix HA.
- VM failover on the same network.
- Network redundancy to both the VM network and the availability services “heartbeat” network.
Frequently, customers do not meet these rather simple requirements. The result is that they either get a false positive — where failovers happen when they shouldn’t — often referred to as the “split brain” phenomena.
Alternatively, when a failure actually happens, the failover process fails for one or more VMs, because a member of the team has stored the VM on local storage or plugged it into a network that just doesn’t exist on the new host where the VM is restarted. This seems to happen regardless of following best practice and adopting change management controls. The reason is that the level of daily access required to carry out the steps of creating a new VM, for example, could never be put through conventional change management routines. It would simply be too bureaucratic to be of any use to a business. Instead, the way to escape these problems is going to be to remove the human element and to reduce the amount of manual steps operators need to complete before they can get their VMs. This means a level of automation and orchestration that the previous generation of virtualization professionals did not reach.
As for testing any availability solution, I’m a firm believer in a hard pull-the-plug test before going into production and at intervals during the life of the installation. Most in-guest availability services allow you to do this non-intrusively, which mitigates the risks of the test and actually causes an outage. But with the HA systems from many virtualization vendors, such soft testing has yet to be developed, leaving many customers to find out at the worst possible time whether their configurations actually work.
Mike Laverick is a professional instructor with 17 years experience in technologies such as Novell, Windows and Citrix. Involved with the VMware community since 2003, Laverick is a VMware forum moderator and member of the London VMware User Group Steering Committee. He is also the owner and author of the virtualization website and blog RTFM Education, where he publishes free guides and utilities aimed at VMware ESX/Virtual Center users.