3dmentat - Fotolia
Small World Big Data
Published: 18 May 2017
With the rise of software-defined storage, in which storage services are implemented as a software layer, the whole idea of data storage is being reimagined. And with the resulting increase in the convergence of compute with storage, the difference between a storage platform and a data-processing platform is further eroding.
Storage takes new forms
Let's look at a few of the ways that storage is driving into new territory:
- Now in containers! Almost all new storage operating systems, at least under the hood, are being written as containerized applications. In fact, we've heard rumors that some traditional storage systems are being converted to containerized form. This has a couple of important implications, including the ability to better handle massive scale-out, increased availability, cloud-deployment friendliness and easier support for converging computation within the storage.
- Merged and converged. Hyper-convergence bakes software-defined storage into convenient, modular appliance units of infrastructure. Hyper-converged infrastructure products, such as those from Hewlett Packard Enterprise's SimpliVity and Nutanix, can greatly reduce storage overhead and help build hybrid clouds. We also see innovative approaches merging storage and compute in new ways, using server-side flash (e.g., Datrium), rack-scale infrastructure pooling (e.g., Drivescale) or even integrating ARM processors on each disk drive (e.g., Igneous).
- Bigger is better. If the rise of big data has taught us anything, it's that keeping more data around is a prerequisite for having the opportunity to mine value from that data. Big data distributions today combine Hadoop and Spark ecosystems, various flavors of databases and scale-out system management into increasingly general-purpose data-processing platforms, all powered by underlying big data storage tools (e.g., Hadoop Distributed File System, Kudu, Alluxio).
- Always faster. If big is good, big and fast are even better. We are seeing new kinds of automatically tiered and cached big data storage and data access layer products designed around creating integrated data pipelines. Many of these tools are really converged big data platforms built for analyzing big and streaming data at internet of things (IoT) scales.
The changing fundamentals
Powering many of these examples are interesting shifts in underlying technical capabilities.
New data processing platforms are handling more metadata per unit of data than ever before. More metadata leads to new, highly efficient ways to innovate -- such as unifying object, file and block services -- providing infinite virtual snapshots, automatic file archiving and VM-optimized application storage.
Then there's data locality. Keeping compute and data local to each other optimizes performance and efficiency. Ten years ago, Hadoop demonstrated how to tackle big data by mapping compute out to partitioned local data chunks, albeit still in a batch query way. Today, we see new kinds of data-processing platform designs that aim to preserve or increase data locality even on real-time streaming data. Whether leveraging dense server-side nonvolatile memory express, distributed in-memory grids, hosting user compute functions right in the storage layer or pushing compute out to the edge of IoT networks, data locality is king.
Today's technologies are blurring the lines between persistent infrastructure storage, back-end databases and active application data sets. We now have many NoSQL -- and NewSQL -- variants offering different architectural approaches to scale and consistency. We have streaming services that process data in pipelines and message queues. We have active data lakes, online archives and analytical databases merging with more traditional operational SQL approaches into a unified data-processing platform. It's increasingly difficult to split data persistence out as a separate stage or phase of data lifecycle management.
And the long-awaited, almost mythical, hybrid cloud storage vision is finally taking shape. The seemingly simple idea to have storage automatically tiering, or caching, from elastic cloud storage to on-premises secondary storage to application -- data local -- primary storage has been difficult to achieve, given all the heterogeneous, multivendor puzzle pieces.
But we all want seamless storage that automatically and inherently spans from on-premises to public cloud that would let us easily use and move our data -- where and when we need it. New vendors are emerging that integrate secondary storage with third-party public clouds (e.g., Igneous), but Oracle is perhaps leading the pack with the ZFS appliance, which closely integrates with the public Oracle Cloud.
So what's next?
Some IT pundits have called storage a dying business, but that's not quite right. With more and faster data every day, every single storage concern from data protection and disaster recovery to global high-performance access to always-on availability only demands more deliberate, storage-focused expertise.
On the other hand, many IT folks might prefer a world without much change, as they already have their hands full maintaining the status quo. Still, most of these new approaches are likely to make a lot of today's low-level, manual tasks obsolete, freeing up time and resources to better focus on adding business value.
With so many trends driving storage into new territories, you will just have to be a bit more flexible in your definition of storage. As in everything IT, agility is ultimately the key.
When storage improves, network bottlenecks remain
Software-defined memory evolves IT storage architectures
Trace the progress of data center storage technology