Preventing virtual resource oversubscription with capacity management

Identify the symptoms of resource oversubscription, and learn capacity management strategies to prevent it.

In today’s data centers, organizations are packing more virtual machines (VMs) onto fewer physical servers. That's...

great for server consolidation, but it could lead to oversubscription of CPU cycles, memory and other computing resources. In this tip, I’ll identify the symptoms of resource oversubscription and offer capacity management suggestions that can limit or prevent it.

Server and desktop consolidation raises concerns that concentrations of workloads bring with them inevitable risk of downtime and the potential of oversaturation of the core computing resources—memory, processor, network and disk. 

In the ideal world, a hypervisor should be consuming these resources as much as possible, while leaving headroom to accommodate the growth in the VMs as well any unexpected surges in workload. Expressed simply, a hypervisor consuming only 1% of memory, CPU, network or disk is underutilized. Likewise, a hypervisor running at 99% of memory, processor, network or disk is likely to provide poor performance and be a major bottleneck in any clustered environment.

Striking a happy balance with capacity management
It’s the administrator’s job to strike a happy balance between these extremes through capacity management. It’s as much an art as a science, and the administrators who are successful know their environments and their capacity to work with application owners to resolve any performance problems.

VMs don’t typically cause most performance problems. It’s badly configured applications that are usually the culprits. The virtualization layer always takes the heat on this because application owners will say that the service worked brilliantly on a physical server. Unless the VM was created by a physical-to-virtual (P2V) process, it is likely that the way the application was configured does not mirror the physical system. That said, most customers paradoxically experience improvements in performance because virtualization projects often bring in new and improved hardware at the server and storage layers.

Memory is king
The biggest single constraining resource in virtualization is memory. This is the resource that environments run out of before they run out of CPU cycles or bandwidth to the network or storage array.

To avoid this, begin by “right-sizing” your VMs relative to the demands of your application. This means resisting the demands of application owners who request VMs with the same specification as the physical server.

A dose of reality is needed here. It’s totally unrealistic to think a Tier 1 application, such Microsoft Exchange, SQL or Oracle, will sit happily with just the allocation needed to make a 64-bit operating system. In general, most environments are risk-averse, and administrators have a tendency to over-spec the VMs in hopes that they will not experience any blowback from disgruntled application owners.

This type of approach to resource allocation should be avoided at all costs. VMs that go beyond spec cost the environment in wasted resources that could have been allocated to more deserving VMs. It also systematically and unnecessarily degrades the performance of features and wastes resources elsewhere.

Wasted resources
Another area to review is any system that has been converted to a VM through the process of P2V. More often than not, IT folks choose not to downgrade the memory allocation, leaving the VMs with the same allocation as the original physical machine. This can contribute to a massive waste in resources and a capacity management problem.
Remember why you virtualized in the first place? You had many physical systems that were using only 10% to 20% of their resources, taking up space and power in the data center.

If you think you are experiencing memory problems, check the following areas:

  • Does the VM have enough memory allocated?
  • Has the physical server run out of memory?
  • Is there swap activity taking place inside the guest operating system and at the hypervisor level?
  • Are there unusually high statistics in the hypervisor’s memory management systems?

CPU bottlenecks
In some environments CPU bottlenecks can happen. In these situations, the memory payload of the guest operating system and application is small, but the service is carrying out a high volume of transactions per second.

Despite all the fancy footwork that modern hypervisors bring to the table, it is still the case that one virtual CPU (vCPU) virtual machine (VM) can send threads to only a single CPU core. In this respect, raw performance is the number of cycles per second that core can provide.

Bear in mind that with virtualization it is still unlikely that a VM with one vCPU would gain exclusive access to a core. It’s more likely that the VM will have to share the CPU with other VMs. 

Without proper capacity management, if this sharing is allowed to continue unchecked, it is possible that the CPU could become saturated with requests. In this scenario, CPU contention takes place. Fundamentally, if a VM needs more CPU cycles than the core can provide, the only way to deliver CPU cycles greater than 100% is to configure the VM with two or more vCPUs to deliver true symmetric multi-processing.

Before you do that, though, you need to decide if the CPU is the constraining resource and confirm that the application within the guest operating system is multi-threaded. There’s little point in giving the VM two vCPUs if threads are executed only on CPU0 while CPU1 is still there twiddling its thumbs.

As time goes on, you might find yourself running on the more modern Intel Nehalem architecture. Studies have shown that a single Intel Nethalem core can actually outperform a SMP-enabled system using the older CPU types. To identify CPU bottlenecks, investigate the following areas:

  • Using your hypervisor, identify if any VMs are using 100% of the CPU. Avoid using the guest operating system tools in the VM, and look for performance data delivered by the hypervisor—it will be more accurate.
  • Look for high %Ready values in VMware ESX because this is an indication that your VM would like more CPU cycles but isn’t receiving them.
  • Look for high co-stop values because this can show excessive use of SMP in your VMs.

That last tip needs some explanation. Sometimes administrators go overboard in giving every VM more than one vCPU as a standard—even when it’s not entirely necessary. If you use virtual SMP excessively, you can give the hypervisor more work than it’s expecting. It has to work harder to schedule multiple vCPUs across multiple cores inside the CPU socket.

What can happen is this situation can actually increase the very contention you were trying to avoid in the first place. So strictly control the use of virtual SMP, especially on the modern CPU architectures where its benefits may be limited.

Network bottlenecks
It’s a common misconception that with many VMs sharing the same physical networks, network bandwidth would become scarce. As with CPU resources, most networks work in a non-linear way so that many systems can co-exist on the same network without treading on each other’s toes.

In most environments you will struggle to see VMs saturate even a 1 Gbps interface. It’s common that these interfaces are teamed together for fault-tolerance and load balancing. So if it’s unlikely in a 1 Gbps environment, it’s even less so when you have bundles of network interface cards.

The reality is that your network bottlenecks are more likely to be seen during the process commonly called live migration in which VMs with large memory allocations are moved from one physical server to another. Again, right-sizing your VM’s memory allocation is critical in reducing the performance hit that live migration brings. So it’s important to have dedicated gigabit-and-above network interfaces dedicated to this ancillary process.

Storage bottlenecks
Storage oversubscription is usually caused by simple capacity management administrative errors. What’s surprising are the I/O demands that virtual desktop projects can sometimes impose at the storage layer.

It’s not uncommon to have storms of storage and CPU I/O caused by “boot storms” and anti-virus scanning. Storage vendors can help by allowing customers to purchase caching modules that add solid-state storage to elevate the I/O chokepoints in virtual desktop infrastructure (VDI). 

Adequate capacity management planning needs to occur at an early stage so the costs of scaling up a VDI solution are exposed early. The tipping point for the use of such caching technologies appears to be 500 to 600 VMs.

Below this point, simply distributing the virtual desktop around various arrays, logical unit numbers (LUNs) and spindles appears to be enough. Beyond the 500 to 600 VM range, businesses should seriously begin considering solid-state solutions as a way of taking the disk spindle out of the equation.

This caching approach can also elevate some of the IOPS generated by routine VDI tasks such as creating and destroying virtual desktops as users log out or join the system.

For server-based VMs that are disk I/O bound, many simple tasks can be used to boost performance. Although it’s perfectly fine for 10 or 20 VMs to occupy the same volume/LUN, it is reasonable for disk I/O bound VMs to dedicate a volume/LUN to specific application or merely reduce the ratio of VMs to a datastore to increase the available disk IOPS shared among the competing VMs and shorten the disk queues to storage. Other optimization techniques include the following:

  • When possible, use the hypervisors paravirtualized SCSI controller inside the VM.
  • Distribute the virtual disks of the VM across multiple volume/LUNs to ensure virtual disks do not compete against each other for I/O.
  • Adopt permissions on datastores to prevent rogue administrators from creating VMs in the wrong location by sorting storage according to the amount of free space without considering the IOPS needs of the VM.

Although it‘s true that higher consolidation ratios increase the potential for saturation of core resources, it’s not inevitable. Today’s hardware and software are keeping up with these increased resource demands.

In the world of virtualization, memory continues to be the key constraint. But there are plenty of capacity management steps an administrator can make to tweak performance and avoid VM contention before it gets out of hand.

Eventually the constraint on consolidation ratios may be more about businesses feeling anxious about putting too many eggs in one basket. In this respect it could be that an availability gap is opening. Although hardware and hypervisors increase resource capabilities, the risks of very high consolidation ratios remain despite the widespread use of hypervisor clustering, fault tolerance and in-guest service protection tools.

About the expert: Mike Laverick is a professional instructor with 17 years experience in technologies such as Novell, Windows and Citrix. Involved with the VMware community since 2003, Laverick is a VMware forum moderator and member of the London VMware User Group Steering Committee. He is also the owner and author of the virtualization website and blog RTFM Education, where he publishes free guides and utilities aimed at VMware ESX/Virtual Center users.


Dig Deeper on Data center capacity planning