Victoria - Fotolia
- Alex Barrett, Modern Infrastructure Editor-in-Chief
Data location is the determining factor of many big data analytics decisions, including what products enterprises should use for storage and processing.
"Wherever you create data, it tends to stay there, because it's such a pain to move it," said Charles Zedlewski, vice president of products at Cloudera, a data management company with software based on Apache Hadoop.
It should come as no surprise, therefore, that traditional storage array vendors take a keen interest in third-platform workloads. These include applications centered on mobile devices, cloud services, social technologies and big data.
EMC Corp. is practically synonymous with network-attached storage (NAS) and storage networks, those shared file and block arrays that are the traditional home for so much enterprise data. The company is working hard to align itself with the object storage and Hadoop Distributed File System (HDFS) that go along with analytics workloads.
"EMC's strategy is to deliver a hybrid of those two worlds, storage that can deliver in a bimodal fashion," said Sam Grocott, EMC senior vice president of product management and marketing. For instance, EMC Isilon scale-out NAS arrays are being enhanced to support HDFS, while its Elastic Cloud Storage, or ECS, supports both object storage and HDFS.
But having data that lives on-premises doesn't necessarily preclude processing it in the cloud. IntraLinks, a York based cloud provider of secure enterprise file share services, has traditionally required users to upload their data to its cloud platform, from which IntraLinks creates virtual deal rooms used by business partners engaged in due diligence around a potential merger. More recently, IntraLinks added the option for its customers to deploy a "content node" appliance on-premises, for those organizations that can't, or don't want to, store their documents in IntraLink's clouds, said Todd Partridge, director of strategy and product marketing.
"We provide secure file sharing and collaboration for content that is highly regulated," Partridge explained. With its distributed content node architecture, IntraLinks can manage and manipulate data in its system, even if it is physically stored on the customer's premises. "This creates a data storage location spectrum, where the spectrum ranges from pure public cloud to pure on-premises."
Separating compute and data storage in different data centers can be logistically challenging and more expensive, since moving that data around incurs additional bandwidth charges. But here, too, things are changing. Moving data is no longer the obstacle it used to be.
Marc Clark, director for cloud strategy and deployment at Teradata, sold network bandwidth earlier in his career. He said that as recently as 2010, a 45-Mb connection ran about $3,000 per month. These days, assuming that there's fiber "in the ground," that same $3,000 will get you a 10-Gb connection, he said.
All these factors will combine to make cloud-based analytics much more prevalent. Companies that have built their entire existence on-premises in data centers are actively promoting cloud offerings.
Teradata, for instance, once exclusively focused on the largest companies. It now has a cloud-based version of its data warehouse tools. "Our focus is still the Global 2000, but we want to broaden [it] and reach down to the midmarket," Clark said.
Working with a data center partner, Teradata offers its customers dedicated cloud compute units, with production data sets of up to 70 TB and a roadmap for up to 100 TB. Unlike other cloud-based data warehousing services, such as AWS Redshift, the Teradata service runs on dedicated bare-metal hardware, because "virtualization and multi-tenancy are the enemies of [massively parallel processing]," Clark said. "You need more compute power than you can get from virtual hardware."
Still, taking advantage of cloud analytics can help organizations jumpstart their big data practices in a way that they otherwise couldn't.
Altiscale, a Hadoop as a Service provider, was founded in 2012 by former Yahoo employees who had built out that company's internal Hadoop service. It caters to the midmarket, which may have a "serious business use case," but not the ability to recruit a team of people to run a Hadoop cluster with the needed performance and at the right price, said David Chaiken, Altiscale's CTO. "By deploying as a virtual private cloud, [customers] can securely connect to our Hadoop infrastructure, and it looks like a piece of their own data center."
And some organizations report cloud performance to be adequate for their big data systems' needs.
Tubular Labs provides a video marketing platform to analyze viewing habits of users across over 30 social media platforms, including YouTube, Facebook and Twitter. As a startup, the company runs Cloudera Impala real-time analytics on the largest AWS instances it can find that support solid state drive (SSD) storage.
"When our customers have a question, we have an answer," said David Koblas, Tubular Labs' CTO. "The thing that makes a difference when doing that kind of real-time data processing is SSD."
Koblas admits that they might be able to get slightly better performance from dedicated, non-virtualized hardware. But from his perspective at a startup company, it isn't worth it. "It comes down to whether or not you want a data center," he said. "The very thought of having a data center is a minimum of $250,000 year -- two staffers to be able to drive over to the data center if a disk drive fails or something."
Performance aside, there are other ways that cloud and on-premises analytics are different, said Cloudera's Zedlewski, and ultimately, the goal is for the entire experience to be the same.
The company recently finished enhancing its AWS service to take advantage of the underlying elasticity of the product. "You want it to be just as elastic as the public cloud: Turn on Hadoop for an hour, so you can get that by-the-drink consumption," he said.
Likewise, Cloudera is working on enhancing its blob (object) storage capabilities. "When it's on-premises, it's just us -- there's nothing between us and the blob storage," Zedlewski said. But in the cloud, the application has to be "adept at slurpring data out of the blob storage."
When this work is done, these enhancements will result in a marked shift toward cloud deployments and away from on-premises big data servers and storage, Zedlewski predicts. In terms of percentage of data under management and/or revenues, the mix of public cloud will be relatively modest for his firm, gaining ground to perhaps 70/30 in favor of on-premises, rather than 90/10 today. But in terms of sheer quantity of deployments, "We expect it to shift aggressively to the cloud."
ALEX BARRETT is editor in chief of TechTarget's Modern Infrastructure.
Dig Deeper on Enterprise data storage strategies
The main picks for Hadoop distributions on the market
What machine learning practitioners can learn from data warehousing
Big Data Cloud Service streamlines Oracle Hadoop deployments
Spark and S3 storage carry forward NBC big data initiative