Sergey Nivens - Fotolia
Modern Infrastructure Editor-in-Chief
Published: 17 Sep 2015
Ask 10 different companies about what infrastructure they need to run their big data workloads and you'll get 10 very different answers. There are few rules, and even fewer best practices.
Big data analytics can be a real drain on infrastructure, both in resources and expertise. As the name implies, the data sets that big data analytics tools work against can be, well, large, requiring significant amounts of compute, storage and network resources to meet performance goals. The toolsets, meanwhile, are not well understood by mainstream IT organizations and were often developed by hyperscale companies, without the same level of concern for security and high availability that enterprises demand. Add in uncertainty regarding big data ROI, and it's a miracle businesses are doing big data at all.
Still, among organizations that have dabbled in running big data clusters on Hadoop, Spark and the like, a few themes about the technical and business challenges of big data infrastructure have emerged.
Big data, big questions
A large telecommunications provider is building a new digital service that will launch later this year, and plans to use Hadoop to analyze content, usage and monetization (advertising) data generated by the service. But because this service is brand new, it's hard to know what kind of big data infrastructure to put in place, said the vice president of technology responsible for the build-out.
"It's impossible to do any kind of capacity planning on a product that hasn't launched yet," he said.
Indeed, the emerging quality of most big data initiatives is actually pervasive. "The nature of most big data deployments is much more nascent than I thought it would be," said Andrew Warfield, CTO at Coho Data, a provider of scale-out storage infrastructure.
But that doesn't mean organizations shouldn't pay a lot of attention to big data initiatives. Even if an organization only dabbles in big data, "it runs the big risk of this stuff becoming important," Warfield said, behooving them to think about infrastructure up-front.
For the telecommunications provider, that meant taking an incremental approach. It used software from BlueData Software to run big data clusters on top of commodity hardware that can access data from existing storage systems.
Data here, there and everywhere
If data is born in the cloud, it makes sense to analyze it there. If data is all on-premises, supporting infrastructure should be there too. But data that is scattered all over the place complicates the infrastructure equation.
The telecommunication provider's service will use data from both the cloud and on-premises. It's important for any big data solution to support both, for compliance reasons and to save time and network bandwidth. "Replicating production data is tough," the VP said. "We want to allow all instances to point to a single source."
Alternately, information that data scientists want to analyze is available, but they can't use it because it resides on storage infrastructure that is not accessible by its big data compute farm, said Coho's Warfield. One solution is storage hardware that exposes data via protocols such as Hadoop Distributed File System, or with a RESTful API.
Look out for latency
The time it takes to move data from the storage array to the compute farm is a performance killer for a certain subset of big data analytics. What if you could avoid that latency by leaving the data where it is, and bring the application to it, rather than schlep the data across a network to the compute farm?
The notion of bringing compute to the data isn't really new, but there is a new twist: Docker. Coho Data, for instance, did a proof of concept along with Intel at a large financial services company to run Hadoop workloads directly on its compute nodes, packaged in the form of Docker containers.
The idea behind running Docker containers directly on the storage array is to run ad hoc analytics closer to the data, without having to move data over the network, and take advantage of any available compute resources. "The platform has always been CPU-heavy relative to other storage platforms," Warfield said. "All the more so when you put flash into the system. The question then becomes, 'How do I get more value out of this resource?'"
Running Dockerized applications directly on a storage array is interesting, but the workload needs to be carefully evaluated to see if it's a good fit, said Bubba Hines, a vice president at Signature Tech Studios, which offers a document management service for the construction industry. The service is built on top of Amazon Web Services and uses dedicated storage as a service from Zadara Storage. The firm recently began evaluating the new Zadara Container Service, in which Dockerized apps run directly on the storage array, with direct access to local drives. According to Hines, there are several plausible use cases: running a containerized version of its disaster recovery service on the storage array to continually monitor for changes in customer data or jobs that modify or verify primary storage data.
But it wouldn't make sense to use the Zadara Container Service for all of its data processing needs. Signature Tech Studio's bread and butter is performing data transformations on construction blueprints, which it has already largely Dockerized. But "we're probably not going to move all those [Docker containers] in to the [Zadara] Container Service because the size and scale just doesn't make sense," said Hines. "We have to look for workloads where we can really benefit from low-latency."
Alex Barrett is editor and chief of Modern Infrastructure. Contact her at firstname.lastname@example.org.