Big data isn’t just for the big boys anymore. Advanced data analytics is beginning to trickle down into enterprise and commercial accounts that have far less data than Web 2.0 titans.
“Big data” began with Facebook’s inbox search engine and Google’s MapReduce, but big data analytics have become a strategy used by many types of companies to grow business. Retail giant Target recently made headlines for its controversial big data analytics strategy, which it reportedly uses to predict consumer buying habits and win new customers.
For IT pros, big data analytics often means implementing vastly horizontally scalable farms of relatively small, low-power components, which runs contrary to the themes of convergence and consolidation prevalent in the virtualization era.
Big data infrastructure in the enterprise: A tangled web
CareCore National, a health care benefits management firm with data centers in South Carolina and Colorado, relies on data analytics to create very specific comparisons of current medical patients with their peers in the name of making better medical decisions.
To support this, the company has built a converged infrastructure based on the Virtual Computing Environment (VCE) coalition’s Vblock, an amalgamation of products from VMware Inc., Cisco Systems Inc. and EMC Corp.
Some of CareCore’s big data calculations draw from an 86 terabyte (TB) Symmetrix V-Max disk array that’s part of the Vblock, but another 96 TB reservoir of data spreads across two 48 TB EMC GreenPlum Distributed Computation Appliances (DCAs), which must be managed separately.
“The DCAs now live outside of my [chosen] infrastructure. That’s a level of frustration to be sure,” said CareCore CTO Bill Moore. “We’ve [also] had to work really hard to extend our 10 Gigabit Ethernet fabric into these DCAs.”
CareCore brings data out of those repositories using Apache Hadoop along with VMware’s vFabric GemFire, an in-memory data management system. The company processes them through analysis and visualization interfaces including Alpine Miner and a tool from Tableau Software.
What is big data?
Big data isn’t just data growth, nor is it a single technology. Distributed analytics platforms such as Hadoop and distributed databases including NoSQL or Cassandra are commonly associated with big data, but at heart, it’s a set of processes and technologies that can crunch through substantial data sets quickly to make complex, often real-time decisions.
In today’s medical world, it’s not difficult to predict when a patient has an 87% chance of suffering a heart attack in the next 12 months, Moore said. What CareCore wants is a set of treatment plans a doctor can select based on the specific patient’s symptoms and history as well as an “ad-hoc cohort” or group of patients pulled from CareCore’s records that match the patient.
To do this, complex queries issued at the virtual layer of the infrastructure have to be dragged off of the Cisco Unified Computing System (UCS) compute chassis, sent out to a Cisco Nexus core switch, down through Nexus 5000 edge switches, then to Nexus 2000 top-of-rack switches that hook into the GreenPlum DCA racks, into the DCA servers, and then back again.
Moore hopes to have this functionality within the Vblock fabric someday, he said.
Even within Vblock, GemFire in-memory data management requires a cordoned-off set of UCS blade servers, each with 96 GB of available memory. A single GemFire workload can eat up half that memory, meaning CareCore has to set very specific VMware Distributed Resource Scheduler (DRS) rules to make sure too many workloads aren’t trying to share the same blade.
In addition to heavy-duty memory and a Vblock infrastructure, the CareCore big data process takes for granted a complete 10 Gigabit Ethernet (10GbE) network throughout the enterprise.
“If I had bare iron servers that were running [Gigabit Ethernet traffic] or even some subset of 10 [GbE] and needed to get these very large query sets moved around the data center, I’d be talking about a lot of plumbing in between to make these things work in [short] time frames,” Moore said.
The big data trickle-down effect: eHarmony
Web-based businesses have had to rethink their data center infrastructure to support big data as it moves away from its origins in big search engines and enormous social networks.
For example, online dating hub eHarmony uses Hadoop in its matchmaking process, based on a relatively small data set of 64 terabytes but added a radically new server infrastructure to accommodate it.
Up until June of last year, eHarmony had farmed out its Hadoop operations to Amazon.com’s Elastic MapReduce cloud service but decided to bring it back into its internal data center when monthly subscription fees grew too much.
But before it could bring Hadoop in-house, eHarmony had to address the power and cooling requirements its CPU-intensive workloads would require — a farm of 256 dual-core servers. Using x86 Intel Xeon servers could’ve sent power and cooling costs through the roof.
Instead, eHarmony turned to microservers from startup SeaMicro. These appliances pack 512 cores into a single 10U appliance that draws 3.5 kilowatts of energy for the entire Hadoop environment.
“Normally we run a dual-socket Xeon processor server. On 3,000 watts we’re able to get a maximum of 80 cores,” said Ram Reddy, VP of technology operations for eHarmony. “So in that sense you’re looking at about five times the efficiency.”
That doesn’t mean it was easy to go with a relatively unknown company for a core aspect of the business, and eHarmony didn’t let go of Amazon Elastic MapReduce in one fell swoop.
“For a while we were running both, just until SeaMicro proved itself and was proven to be reliable and performant,” Reddy said.
Ancestry.com and Wikimedia: more big data shakeups
Elsewhere, the Web-based genealogy depot Ancestry.com has found big data analytics requires a change from data storage as usual. It stores about 5 petabytes – or 5,000 terabytes – of data, much of it unstructured data, such as historical documents and images, on an EMC Isilon Network Attached Storage (NAS) farm. The data is mirrored to a secondary disaster recovery site.
But when the company began a big data project based on Hadoop a few months ago to analyze how users were moving through the site, it stood up a separate HP server cluster with just less than 100 TB of “cheap and deep” direct-attached storage, according to Travis Smith, senior manager for Ancestry.com’s storage systems.
The Wikimedia Foundation, proprietor of Wikipedia.org, is also eyeing a data center transition with a forthcoming big data analytics project. Its preparing to test out NoSQL for a new data analytics project designed to drive up the number of article authors and editors on the site, which has declined from about 100,000 a month to 90,000 per month.
“We know that we have more male contributors and editors than female, and we have to figure out, why is that the case and how do we make it more appealing to female editors?” said CT Woo, director of technical operations for Wikimedia.
Wikipedia is mostly text-based and, with images and other multimedia assets included, the entire site is comprised of about 18 TB of data accessed by about 700 physical servers.
The traffic Wikipedia gets makes it one of the five biggest Web properties in the world, and analyzing that traffic according to a number of variables will require at least another rack of servers to be added to an existing data center in Tampa, Fla.
The problem? That data center is running out of power and cooling space, meaning Wikimedia will probably have to migrate its big data operation to a new data center in Ashburn, Va. For a small staff, it will be a big undertaking, according to Woo.
“When you do a data center migration you want to do it right,” he said. “It creates more work for us as a result. It’s time-consuming.”
Beth Pariseau is a senior news writer for SearchServerVirtualization.com and SearchDataCenter.com. Write to her at firstname.lastname@example.org.