BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
The internet of things involves the placement of sensors on everything from cars to refrigerators to humans and transmitting that data via an internet connection to a central repository for storage. Once there, it becomes part of big data, which is the analysis of all that information.
Big data, however, extends far beyond just the internet of things (IoT). Big data projects can analyze data from traditional or modern databases and even unstructured data. Big data can also correlate the seemingly unrelated information that sensors collect with information in traditional databases to improve organizational efficiency. For example, a shipping company may use sensors in its vehicles to direct drivers along routes that improve delivery efficiency and reduce fuel costs.
The benefits of a big data or IoT project can lead to enhanced productivity, better health or simply a more enjoyable life. As users become more comfortable with the concept, and technology allows for the less obtrusive installation of more devices, the amount of data organizations gather increases exponentially. The challenge is to store this data, which is notably different in both type and quantity from traditional storage data.
Storage demands for a big data, IoT project
From a storage perspective, IoT and big data are similar, but they have different demands. The storage response for an IoT project is dependent on the use case. For sensors, an IoT storage system needs to handle rapid input from potentially millions of sensors simultaneously. Because the data these sensors produce is often tiny, the target storage system needs to store what might amount to trillions of small files without impairing performance.
But an IoT project can also include surveillance images from cameras or drones. This data type is typically a continuous stream, so its storage is dependent on high bandwidth and the ability to store fewer but much larger, high-capacity files than the sensor use case. What makes the challenges even more daunting is that it is not uncommon for an organization to require storage for both IoT use cases.
From a big data perspective, the storage system needs to have access to all, or at least most, of the data that the IoT project creates. You can also use the big data project to analyze existing databases and other unstructured data, as well as to correlate the disparate data sets.
By far, the most common foundation for big data is Hadoop. The Hadoop File System (HDFS) creates a cluster of processing servers and assigns an analytics job to the least busy node in the cluster. The intent is for the data that the node needs to analyze to be local on that node. This scenario eliminates the need for an expensive network infrastructure and enables the use of low-cost, server-class storage instead of expensive, shared enterprise-level storage.
The data footprint and storage I/O requirements of IoT and big data differ from those of the traditional data center application. First, IoT data is typically a continuous feed. Data sizes can vary from miniscule to enormous. The number of files to store can reach into the trillions. This makes it easy to quickly create large amounts of data, and, as a result, there is a constant demand for capacity growth.
And that growth must scale quickly and in ways that aren't disruptive. Storage systems for an IoT project also need to scale cost-effectively so that an organization can store petabytes of data for a long time. That requires low administration costs and burdens. Most IT staff simply cannot manage a dozen storage systems from six different vendors. IT professionals need to drive their storage hardware requirements to one to three storage systems that cover Tier 1 and Tier 2 applications, as well as the immense amount of unstructured data that IoT and big data create.
Finding the answers to your IoT project challenges
IoT and big data create a number of challenges for IT professionals. IoT has two different file storage needs, and most organizations will eventually need both. The first requires high, random ingestion of trillions of small files. The second requires high-bandwidth streaming of much fewer, but much larger, files. It is extremely rare for a single storage system to provide both of these capabilities. Typically, they are tuned for handling trillions of small files or tuned for streaming large files.
Big data projects bring another set of challenges. First, much -- if not all -- of the data from the IoT project needs a transfer to the Hadoop cluster for analysis. Second, the Hadoop cluster must have access to the traditional data in the business, such as databases and user data. In addition, there are challenges with HDFS itself. For example, a single node is in charge of analytics job assignment. It also stores all of the metadata for the cluster. If that node goes down, the entire cluster may fail.
There is also the challenge of Hadoop's local storage design. Data protection takes place by replicating copies of data between nodes. Most organizations will select a three-way replication as a default. This means these challenges, from a capacity perspective, are now multiplied by a factor of three, plus the data already residing on the IoT storage systems.
Another challenge in the Hadoop design is that the most available node in the cluster to process the job may not actually have the data stored on it. This means the job will have a less capable node handling it, or the job needs to transfer the data to the most capable node.
The central question then becomes: Can a single storage system solve all of these problems?
The answer depends on the use case. Object storage systems are obvious candidates to be the back-end storage devices for IoT data. Experience shows us that they are more than adequate to support Hadoop environments.
For IoT environments, object storage systems are adept at handling high file object count environments. Most object storage systems can also be the back-end storage device for Hadoop environments, either through Amazon Simple Storage Service compatibility or, in some cases, native HDFS support. Providing the Hadoop infrastructure with a shared storage back end adds network latency, but it lessens the burden on the single master control node. It also eliminates the need for 3X replication, because most object storage systems use a parity-based data protection scheme, such as erasure coding.
The other advantage of using an object storage system is that the IoT devices can directly send data to the same storage the Hadoop environment is using. The sharing of the data means a reduction in capacity consumption and is not wasting time waiting for data to transfer between an IoT data storage device and a Hadoop storage device.
The challenge with that design is the data center will likely still need another storage system for its production application environment. The organization may also need to store and process video data from IP cameras and similar IoT devices. If that's the case, then some object storage systems may not be appropriate; it would not be optimal to tune others to effectively handle both large and small files at the same time.
Beyond object storage
The protocols within the data center are starting to blend. Many storage systems on the market can provide a variety of protocol support, including object, network file system (NFS), server message block (SMB), internet small computer system interface (iSCSI) and even Fibre Channel (FC).
Each protocol performs well with different use cases. For example, FC is ideal for mission-critical databases, but often considered too expensive for Tier 2 and Tier 3 applications. ISCSI is often the protocol of choice for the lower-priority applications. NFS is excellent for high-performance file share and is gaining traction as a storage area for virtual machine images. Even for a big data or IoT project, there are times when NFS is more appropriate than object storage.
Most data centers will have to select at least one storage system to complement their primary storage system. While object storage is capturing a lot of attention, high-performance, cost-effective NFS/SMB answers are making a comeback. These systems scale out like object storage systems do, often have a similar erasure coding type of data protection and support a wide variety of protocols. In some cases, they can perform all of the above.
Which strategy an organization chooses will depend on what type of IoT and big data they expect to manage, and the scope of the project. Another consideration is the age and suitability of its current storage assets to solve the IoT and big data problems. If the data center's current production storage is supporting high-performance requirements of Tier 1 and Tier 2 applications, adding object storage on the back end may be ideal.
If the performance requirement of the Tier 1 and Tier 2 applications is somewhat more modest, then a single storage infrastructure that delivers all protocols may be of interest. While these more general purpose systems don't tend to perform as well as focused systems, they often provide more than adequate performance for a typical data center. Plus they offer the benefit of consolidation to a single storage system. The result should be lower costs and an increase in operational simplicity.
IoT and big data can change how an organization conducts its business. The insight that the combination can provide allows a company to make significant improvements to the way it creates new products and responds to customers. But these initiatives have a significant impact on an IT infrastructure, especially storage.
IT professionals need a strategy for a big data and IoT project that allows the storage infrastructure to live up to its full potential. The right products are available to meet the challenge, whether that's for high file counts and high capacity or a consolidated storage answer.
Find out who owns the IoT data
How the landscape of IoT data and analytics has changed
Why big data and IoT benefit from machine learning