IT pros say the new memory management features and virtualization capabilities in the latest x86 processors are welcome, but the complexity involved can be challenging.
From chip manufacturers to operating system vendors to virtualization software providers, everyone offers features to optimize memory capacity. In a virtualized data center, these features become crucial, where x86 server utilization has now increased and more apps battle for the same memory resources.
One issue is that a single IT staffer covers a lot of jobs, so it's increasingly difficult to track data center issues as they arise.
"Not only am I the VMware administrator, but I'm also the network engineer, systems engineer, SAN administrator, and I share a security administrator role," said Kendrick Coleman, a VMware administrator at a nonprofit in the Louisville area. "That's a lot of different hats to wear, and I try to spend at least part of my day reading articles and blogs that focus on all aspects of my job."
Memory management features pose problems
Coleman recently ran into a problem with a memory management feature. In x86 processors, the trend is to build memory management features into the silicon. Intel's Xeon 5500, or Nehalem series, has them, as do AMD Opteron processors using Rapid Virtualization Indexing, also called RVI. But some IT pros running VMware ESX on this hardware have encountered what at first glance appears to be a memory allocation problem.
Coleman's company runs VMware ESX 3.5 on two IBM x3650 M2 rackmount servers. The servers run on Intel Xeon 5530 quad-core, 2.4 GHz Nehalem chips, with 48 GB of RAM and thus far host about 11 virtual machines.
Coleman reported some "weird memory issues when running ESX on Nehalem." In one instance, a VM's task manager reported using only 530 MB of RAM, even though it had been allocated more than 2.5 GB of RAM from the host.
It was "as if some of our VMs just aren't doing TPS [transparent page sharing] as they should," Coleman said, referring to Transparent Page Sharing, a long-standing VMware ESX memory deduplication feature.
Coleman's case isn't unusual, as evidenced by an extensive thread on VMware's community site, and does, in fact, relate to TPS, said Eric Horschman, VMware product marketing director.
How VM memory issues arise
The ESX hypervisor inspects memory pages loaded by guest operating systems running in virtual machines. When it finds identical memory pages, it saves a copy of the page and then creates a pointer to it for the virtual machines. Duplicate memory pages are common in highly virtualized environments; if someone runs 20 copies of Windows Server on the same physical machine, there's a good chance there are identical memory pages.
By using TPS, IT pros can save host memory and cram more virtual machines into the physical box.
When installed on older processors, ESX performs several memory management functions within its software. But in the chips with memory management features, ESX is freed for other tasks. It can use larger memory pages, which boost application performance, particularly database apps such as Oracle and SQL Server. (If the application doesn't have to access as many memory pages, it can perform faster.)
But if a server isn't running enough virtual machines to consume all the system's memory -- a state often called being "undercommitted" -- and ESX is using larger memory pages, TPS won't be as effective because there are fewer memory pages it can dedupe, resulting in Coleman's problem.
But Horschman said that as a server's memory gets closer to being fully committed, ESX switches back to smaller memory pages so that TPS can dedupe more effectively and end users can then pack more VMs into the host.
That explanation makes sense to Coleman, who conceded that he didn't encounter application performance issues, just seeming memory issues. But for many IT pros, the perception is still a problem. Horschman said that Update 1 to ESX version 4 corrects this perception problem, but at the time Coleman ran ESX 3.5.
"It's just a side effect that was never really brought out to the open and shared among Nehalem users," Coleman said. He later added that "perhaps the consumer is to blame for purchasing a product and not knowing the pros and cons. But we can't ask Intel to demo a CPU, so how are we to know?"
An 'issue' or a bug?
Matthew Doak, a systems engineer at the Grand Rapids, Mich.-based healthcare management company ProCare Systems Inc., confronted similar TPS issues in vSphere 4 and was displeased with VMware's response.
"I think they initially underestimated the effect the bug was having in terms of generating false alerts in VCenter and even causing [Distributed Resource Scheduler] to move VMs around because it thought memory was overutilized on the Nehalem hosts," he said. ProCare circumvented the problem by following suggestions on a VMware community forum until the company issued the patch in Update 1.
Doak advised that anyone who is engaged in a virtualization project should pay close attention to compatibility lists to make sure the hardware is supported. Doak experienced repeated host crashes after upgrading to vSphere 4. He eventually discovered that this "ugly experience" was because a host had mismatched CPUs.
"Despite running fine for over a year with ESX 3.5, upgrading to 4 caused not only that host to crash repeatedly but also other hosts that had machines VMotioned to them from that host to crash with a purple screen of death," he said. "It was pretty much a nightmare."Whether these problems result from bugs or from a lack of administrator training, they can have a chilling effect on virtualization. Place Properties, an Atlanta-based property developer, purchased three Nehalem-based Dell R710 servers in anticipation of ESX 4, said Nathan Raper, a systems support manager. He too experienced ESX's memory reporting problems, which caused false alarms and, more important, a lack of confidence in VMware from the company's executive committee as it tried to execute on the proof of concept.
Despite these problems, IT pros like Raper haven't asked for fewer virtualization and memory management features. They just want better compatibility and perhaps better documentation and support for potential problems that may come.
"While there is a level of complexity being added to the IT industry in regards to new virtualization technologies, at the end of the day really it's getting simpler, not harder," he said. "For someone like me, I'm put in a place where I have to decide on a virtualization platform and subsequently learn it. But after that, it's actually granting a large amount of tools that are making our lives simpler."
Mark Fontecchio can be reached at email@example.com.