IT pros called in on big data projects are finding that the typical approach doesn’t play nice on enterprise-grade virtualized infrastructure.
Brace yourself for big data. If it hasn’t already hit your data center, it will soon, putting new demands on IT infrastructure and operations.
Big data analytics are used by sites like eHarmony to bring couples together, by retailers to predict customers’ buying behavior, and even by healthcare organizations to predict a person’s lifespan and future ailments. It’s a little bit Big Brother, but it’s also revolutionizing the way computing is used to interpret and influence human behavior.
Big data isn’t just data growth, nor is it a single technology; rather, it’s a set of processes and technologies that can crunch through substantial data sets quickly to make complex, often real-time decisions.
“It really is the future of what health care should be, using predictive analytics to improve treatment,” said Michael Passe, storage architect for Beth Israel Deaconess Medical Center (BIDMC) based in Boston. “It could be really big anywhere you might want a kind of crystal ball.”
Sounds good, but IT professionals involved with big data initiatives may find that the new plans contradict the last decade’s worth of virtualization and consolidation in the data center.
An influx of commodity systems
Generally, big data analytics require an infrastructure that spreads storage and compute power over many nodes, in order to deliver near-instantaneous results to complex queries.
The most commonly used platform for big data analytics is the open-source Apache Hadoop, which uses the Hadoop Distributed File System (HDFS) to manage storage. Distributed databases, including NoSQL or Cassandra, are also commonly associated with big data projects.
These are relatively new technologies, and as such, come with some maturity problems.
For example, HDFS does not natively incorporate certain tenets of storage design that have become gospel to storage managers over the years: archive, backup, snapshot and high availability, said John Webster, senior partner for the Evaluator Group, based in Boulder, Colo.
“Experienced Hadoop users tend to work for social media companies, and they’re coming at this with the idea that storage is dumb disk, where you throw in a node and pound I/O against it,” Webster said. “All the storage intelligence developed over the last two decades, it’s like it doesn’t exist.”
And regulatory compliance? “Forget it,” Webster said. “There’s no way to lock down a file.”
Furthermore, Hadoop is most commonly deployed on a cluster of physical servers in which the storage network and compute network are one and the same, often leaving enterprise storage and infrastructure pros with another separate, physical infrastructure to manage.
At Mazda North America, headquartered in Irvine, Calif., the servers are 90% virtualized, and infrastructure architect Barry Blakeley is working to push that ratio higher.
In the meantime, however, at least one of Mazda’s business units is considering big data projects using QlikView or SAP’s BizObjects, or some combination of the two, much of which requires physical servers with direct-attached local storage.
“I’m trying to virtualize, and here we are putting in physical servers,” Blakeley said.
That translates to management headaches. Separate environments and siloes of data mean “a lot of dashboards, and there are so few of us it becomes unwieldy to manage it all on separate devices,” he added.
Into the big data fold
Projects are afoot to counteract this trend, most notably VMware Inc.’s Project Serengeti, which rolled out version 1.0 at the end of December. Serengeti, like Hadoop itself, is an Apache Software Foundation open source project.
The purpose of the project is to produce a freely downloadable offering that “enables rapid deployment of standardized Apache Hadoop clusters on an existent virtual platform, using spare machine cycles, with no need to purchase additional hardware or software,” according to a VMware blog post.
A virtualized Hadoop cluster can take advantage of VMware’s native high availability and fault tolerance capabilities for availability as well, protecting critical components such as the HDFS NameNode, which keeps track of all the files in the file system and is a single point of failure. High availability for the NameNode is a feature that Hadoop does not yet natively offer, another fingernails-on-a-chalkboard feeling for enterprise infrastructure admins, particularly failure-conscious storage pros.
Other vendors, like Symantec Corp. and Red Hat Inc., propose replacing HDFS with their own scale-out file systems: Clustered File System and the Gluster File System, respectively. These more mature file systems offer capabilities like snapshots and high availability.
At least one centralized storage vendor claims native integration with HDFS that solves its high-availability challenges—EMC Corp.’s Isilon scale-out network attached storage (NAS) system. Incorporating HDFS into Isilon means providing Hadoop users with built-in data protection, greater storage efficiency and better performance than physical clusters built on DAS, claims EMC, in addition to eliminating single points of failure.
BIDMC uses Isilon storage to explore big data analytics for use in its clinical practice, since the hospital has already purchased Isilon hardware for other purposes.
“I want to use the infrastructure because it’s not a Radio Shack science kit; it’s purpose-built to do this kind of thing and it does it very well,” Passe said. “Why would you want some generic thing with its own disks and a higher failure rate if you’ve already got Isilon in place?”
That’s the plan, at least. At the moment, however, as the clinical practice experiments with Microsoft’s SQL and Hadoop integration, called HDInsight, the software is still running on a separate physical cluster. Nor is centralizing storage the only issue with integrating big data into BIDMC’s IT practice, Passe said—Microsoft hasn’t fully integrated Active Directory with HDInsight yet, something the hospital is waiting for before proceeding.
“We’re just starting to figure out how to use it and what makes sense for us, and then trying to figure out how we best posture ourselves from an infrastructure standpoint to support it,” Passe said.
The trouble with virtualizing Hadoop
Still, some analysts say virtualization-centric solutions to the big data infrastructure problem pose their own challenges.
Virtualized Hadoop may work as advertised, but in terms of licensing and system costs, enterprises may find it’s still cheaper to go with commodity, scale-out direct attached storage for big data projects.
Virtualization management also isn’t ideally suited to managing virtualized big data clusters yet, according to Jeff Boles, senior analyst for the Taneja Group, based in Hopkinton, Mass.
“We’ll see some convergence with virtualization vendors fighting their way back with solutions that allow you to virtualize all this stuff, but you still don’t necessarily want to mix that into your main infrastructure pool,” Boles said.
Meanwhile, according to Webster, “purists will say replacing the file system or using something like Isilon is too expensive. Using scale-out storage separate from Hadoop nodes can also add another network to the cluster, increasing complexity,” he said.
As a result, some companies are considering external public clouds as an alternative to rolling out a separate infrastructure for big data within a data center, sidestepping the split-infrastructure problem altogether. That approach has the added bonus of being able to share data sets and analytical results with business or research partners if necessary. Cloud service providers such as Medio Systems Inc. and Amazon Web Services have been offering such big data services for years.
But doing big data analytics in the cloud can also raise some of the same compliance and governance challenges enterprises are already dealing with when it comes to Infrastructure as a Service options, analysts say.
And sidestepping an internal infrastructure may also mean sidestepping IT altogether, resulting in “shadow IT” deployed on public cloud vendors’ infrastructures that IT doesn’t know about, said Webster.
Even if the public cloud is used with the blessing of IT, “Whose data is it?” Webster asked. “And if data is covered under compliance of one sort or another, is the service provider going to cover you?”
To be continued
Eventually, companies like Intel Corp. predict that the scale-out infrastructures associated with big data and the centralized virtual infrastructures popular over the last decade will converge into what’s becoming commonly known as the software-defined data center.
“More and more companies are realizing there’s a lot of value in the data they have that they’re not taking advantage of,” said Christie Rice, marketing director for Intel Corp.’s storage division. “In time it will become a necessary thing if a lot of companies want to be able to stay in business and if they want to be able to expand the business.”
Rice predicts that in the long run, the software-defined data center will commoditize hardware so that any friction between centralized storage systems and scale-out DAS becomes irrelevant—software, whether for compute, networking or storage, could allow servers’ workloads to change on demand.
“We also see solid-state drives being used more and more, particularly as you’re talking about real-time analytics—being able to get data in and out of the storage media faster becomes more important,” Rice said.
For now, big data projects remain confined to a small niche of the enterprise—maybe 3% to 5% of companies, estimated Taneja Group’s Boles. However, he expects that number to double in the next year and a half to two years, and for there to be an eventual “trickle-down effect” from the largest of Web and enterprise entities to small and medium enterprises.
“We’re more serious about analytics than ever before and it’s easier to deploy an analytics solution than ever before,” he said. “That makes it practical for a whole new range of companies.”