WavebreakMediaMicro - Fotolia
Published: 16 Mar 2015
Unrelenting data growth has spawned new scalable storage designs.
We've all read the storage reports about overwhelming data growth. It's certainly a big and growing challenge that deserves attention, but I'll skip the part where I scare you into thinking we're about to be overwhelmed under a deluge of data. We tend to store about as much data as we can, no matter how much data there might be. There has always been more data than we could keep. That's why even the earliest data center storage systems implemented quotas, archives and data summarization.
The new challenge today is effectively mining business value out of the huge amount of newly useful data, with even more coming fast in all areas of IT storage: block, file, object, and big data. If you want to stay competitive, you'll likely have to tackle some data storage scaling projects soon. Newer approaches to large-scale storage can help.
Scaling storage out into space
The first thing to consider is the difference between scale-up and scale-out approaches. Traditional storage systems are based on the scale-up principle, in which you incrementally grow storage capacity by simply adding more disks under a relatively fixed number of storage controllers (or small cluster of storage controllers, with one to four high availability pairs being common). If you exceed the system capacity (or performance drops off), you add another system alongside it.
Scale-up storage approaches are still relevant, especially in flash-first and high-end hybrid platforms, where latency and IOPS performance are important. A large amount of dense flash can serve millions of IOPS from a small footprint. Still, larger capacity scale-up deployments can create difficult challenges -- rolling out multiple scale-up systems tends to fragment the storage space, creates a management burden and requires uneven CapEx investment.
In response, many scalable storage designs have taken a scale-out approach. In scale-out designs, capacity and performance throughput grow incrementally by adding more storage nodes to a networked system cluster. Scale-up designs are often interpreted as having limited vertical growth, whereas scale-out designs imply a relatively unconstrained horizontal growth. Each node can usually service client I/O requests, and depending on how data is spread and replicated internally, each node may access any data in the cluster. As a single cluster can grow to very large scale, system management remains unified (as does the namespace in most cases). This gives scale-out designs a smoother CapEx growth path and a more overall linear performance curve.
Millions of files, trillions of objects
Another trend that helps address storage scalability is a shift from hierarchical file systems towards object storage. File systems were built primarily to provide a human-centric way of navigating smartly through and around large numbers of files. But the way many file systems are implemented builds in natural constraints on scalability. File systems require a live meta database to manage and track file locations, security, read/write locking and navigation information (e.g., when you list the contents of a directory).
This limits most file systems to the millions-of-files range. There are some scalable file storage designs like NetApp's clustered Data ONTAP based on Write Anywhere File Layout and EMC Isilon OneFS, both of which support clustered approaches and scale to serve many service-provider scenarios. But in today's vast cloud-building world, we see object storage as the number one scalable solution.
Object storage takes a different design approach than file or raw block storage. By essentially limiting I/O to just storing and retrieving whole blobs (i.e., whatever size binary large object you want to store as an object) in a flat namespace, it can readily scale out to billions and even trillions of objects. Obviously an object can be a file, but it is really any arbitrary set of raw data bits.
Some object storage systems use erasure coding for data protection, which is basically RAID for distributed objects. In most cases, however, data protection is achieved via outright replication, which lessens the cost of storage nodes, but at some penalty against total storage capacity.
There are other drawbacks to object storage: Client applications must keep track of their stored object's unique storage keys; they can't edit files in place on the storage system; humans aren't able to navigate directly around the namespace. But for applications, especially Web-oriented ones, object storage provides great natural alignment.
Distributed object storage with built-in replication can also act like a content delivery network. Object storage provides a natural data layer for massive, global, distributed, multi-tenant storage services, and is therefore often associated with cloud building.
What exactly is an object storage system? There are object systems that are internally built over file systems, and file systems supporting object storage APIs. In the cloud, there are massively scalable distributed file services built over object storage (e.g., Dropbox on Amazon Web Services S3). We even see object storage used for block-based I/O when supported by fast native object stores (e.g., DDN WOS). But the takeaway here is that object storage is a fundamental part of the answer for the largest growing storage requirements.
Here we should mention Apache Hadoop and its Hadoop Distributed File System (HDFS). Hadoop is designed for storing and processing big data on scale-out clusters. Still, Hadoop data is usually found in large files. When it comes to number of files, HDFS can only realistically track about 10 million in its memory-constrained name-node controllers.
While a big data lake might serve for an enterprise's data sets in terms of its large databases, unstructured text repositories and accumulating logs, it's not designed to deliver storage services for trillions of files or objects.
There are some hybrid big data solutions like MapR and IBM Infosphere BigInsights using GPFS that can provide far more capable storage services under Hadoop than native HDFS. And recently, BlueData Software rolled out a Hadoop virtualization solution, which essentially makes big data processing possible over whatever storage arrays the data is already sitting in, turning the idea of the data lake inside-out.
Cloud-hosted storage is increasingly popular: making use of elastic cloud storage as a colder storage tier backing up on-premises data. You see this in the popularity of Microsoft's StorSimple and NetApp's SteelStore based on its Whitewater acquisition from Riverbed. And most object store use cases, hosted and on-premises, naturally look like cloud service offerings.
But what about live block storage? Zadara Storage offers a way for you to pay for block storage as an elastic subscription. The company delivers the storage using a hybrid of pre-positioned, on-premises (your data center) scale-out appliances and virtual cloud hosted arrays.
On the high end, IBM's high-powered GPFS has also been rolled into an elastic cloud storage system: scale-out, multi-protocol and suitable for high performance workloads and cloud-building. The EMC ECS Appliance is their latest generation of scale-out cloud-building storage, and Red Hat Storage based on Gluster is another good option.
Mike Matchett is a senior analyst and consultant at Taneja Group.