This content is part of the Essential Guide: Emerging data center workloads drive new infrastructure demands
News Stay informed about the latest enterprise technology news and product updates.

Disaggregation unlocks big data infrastructure efficiency

The server-storage mismatch is at the heart of management challenges with big data architecture and can create inefficiencies for hardware running Hadoop and Spark. A different take on disaggregation can help.

Hyperscalers and enterprise IT shops both can benefit from a disaggregated strategy for their data centers, but what's good for the goose isn't always good for the gander.

Most conversations about disaggregated systems focus on scale-up architecture, targeting hyperscale cloud service providers who want more flexible systems and fewer underutilized resources. But the idea also appeals to enterprises seeking to increase the efficiency of the sea of commodity servers and storage running a big data infrastructure, such as Hadoop clusters and other workloads.

For AppNexus Inc., a digital advertising platform, its scale-out infrastructure was a cheap way to run Hadoop -- it got the job done even if it wasn't all that efficient, explained Timothy Smith, senior vice president of technical operations at the New York-based company. A downside was having to statically configure six, 12 or 24 drives per server. "You make your best guess at deployment time and you really don't get the opportunity to optimize it," he said.

On top of that, having servers and storage on different refresh cycles -- a three-year lifetime for servers, and five years for storage -- meant storage was replaced faster than he wanted.

Disaggregation was the answer, Smith determined. Separating the disks from the servers solved several challenges, including buying storage that can remain online for three to five years and allowing dynamic resizing of the drives. When workloads change, the disk to server ratio can change and rebalance.

Not your father's disaggregation

Most conversations about disaggregation focus on scale-up architecture, which is "absolutely on the right track," according to John Webster, senior partner at analyst firm Evaluator Group in Boulder, Colo. Engineered systems or appliances, such as Oracle BDA, Teradata and Hewlett Packard Enterprise Converged Systems, pitch the value of the stack.

Unless you are Google and have [Google File System], there is a hole in the market. We just didn't see a good solution for disaggregating the disk from the servers.
Timothy Smithsenior vice president of technical operations, AppNexus Inc.

AppNexus, however, represents a different need for enterprise IT shops: managing a commodity, scale-out architecture. Smith evaluated alternatives to Hadoop Distributed File System (HDFS) including Embedded Transport Acceleration (ETA) over Ethernet, which at the time was only available from one company and did not work well or scale well, he said. InfiniBand and storage area network were both too expensive, and network-attached storage with Network File System and Server Message Block was not viable.

"Unless you are Google and have [Google File System], there is a hole in the market," Smith said. "We just didn't see a good solution for disaggregating the disk from the servers."

The big data architecture that most enterprises would like to achieve is possible only by the hyperscale companies, but some see a different path: scale out web-scale architecture for big data infrastructure using Hadoop and Spark, built with software-defined flexibility on cheap commodity hardware versus purpose-build products.

"We saw this platform moving into the enterprise as the underlying application platform for big data and IoT," said Gene Banman, CEO at DriveScale, which came out of stealth last week with $15 million in Series A funding and a hardware-software combination that disaggregates storage and servers for big data infrastructure in a scale-out environment. The goal is to help make efficient use of scale-out architecture on commodity hardware along with 10-gigabit Ethernet to compose the resources in a software-defined way.

The DriveScale adapter turns industry-standard JBOD (just a bunch of disks) into Ethernet-connected devices, independent resource pools that can be manipulated with the company's software. The $6,000 adapter offers 80 gigabytes of aggregate throughput in a 1U device with four adapter slots, each with two 12-gig SAS connections for JBODs and two 10-gig connections for the adapter connection to the network fabric.

The hardware-software combination from DriveScale could be "pretty useful" for big data architecture, breaking apart the usual 1:1 scale-out of storage embedded in commodity servers for Hadoop or Spark, where the workload may often have underused capacity, according to Nik Rouda, senior analyst at Enterprise Strategy Group in Milford, Mass.

"Using software to provide what's needed brings more efficiency to the overall environments," he said. "Their hardware seems to enable the rack to work this way, divvying up resources on demand, putting data where it makes sense."

Separating compute and shared storage for big data and analytics has been suggested by other mostly proprietary products, such as EMC Isilon, but they don't have the same software-defined flexibility and don't make use of commodity hardware, Rouda said.

The DriveScale controller and software's goal to disaggregate compute from storage goes after one of the most significant problems with Hadoop: relying on servers and storage scaling together, which is "not particularly efficient," according to Webster, the analyst at Evaluator Group.

The most significant roadblock for DriveScale may not be the technology but the idea that Hadoop has various users across different departments within an enterprise. Not everyone understands its infrastructure challenges. Sometimes centralized IT manages Hadoop clusters, but other times IT may not be involved and it may be in the hands of data scientists.

Enterprise IT will understand the management challenges of big data architecture and a Hadoop cluster better than anyone else if they are on the road to a software-defined data center, Webster said.

Robert Gates covers data centers, data center strategies, server technologies, converged and hyper-converged infrastructure and open source operating systems for SearchDataCenter. Follow him on Twitter @RBGatesTT or email him at [email protected].

Next Steps

Learn the differences between fast and big data architectures

Explore common issues with the HDFS platform

Discover the benefits of disaggregated servers for data centers

Dig Deeper on Emerging IT workload types