This content is part of the Essential Guide: Emerging data center workloads drive new infrastructure demands

Examine data collection problems with big data, IoT in mind

Networking challenges and data collection problems can occur when IT tries to deploy IoT and big data into a data center. Understand how to move data without limiting security.

In part one of this series, we examined how big data and IoT are creating challenges for enterprise IT teams. Understand the relationship between the two, and what IT needs to know before deploying these strategies.

With big data and internet of things, the scale of everything is vast. You'll need to prepare your data center for changes of this sort. Decision-makers will need to approach these big issues with care, especially when it comes to the implications on a network. Consider the points of entry for all that data. Devote time and effort to security measures. And determine whether data from multiple sources will make its way to a single data center.

Let's examine the simple task of incorporating data into the target database. The first challenge is moving data from closed-system devices to the general network. As with any data-intensive application, it's important to understand the traffic flow and latency requirements of the app.

While performance isn't the most critical factor, consider the effects at the edge layer. Let's assume an aggregate amount of data sets are collected before an upload to a centralized database. If an organization has multiple locations across a country, then it will upload several gigabytes of data across multiple sites into a single data center. This data will be imported and exported into the distributed data set. You'll want to determine how users not involved in big data efforts will be affected during data collection. And you'll need to agree on which protocol will be used to transfer the data.

Distributed databases

Distributed databases are funny in that they are distributed. Is a Hadoop cluster local to a single data center or multiple data centers? Where are the users accessing the data? Are these users in the same logical organization? Are there data sovereignty issues around the data? All of these factors have effects on the flow of data. They will also affect any security products selected to support the requirements.

It's the interaction with big data that creates challenges. It can be difficult to put the data close to users. And, in the case of cancer research for example, the user is a dynamic concept. A group of researchers could decide to analyze data using the tools within the Hadoop ecosystem, which could include using in-memory analysis tools such as Spark. Local Spark nodes are placed within the same data center as the Hadoop cluster. Data extractions from Hadoop to Spark are relatively simple problems that network engineers can resolve by deploying bandwidth.

The bigger challenge appears when collaboration or analysis is performed outside the low-latency world of a data center. What happens when researchers need to collaborate on petabytes of data located in a data center that is thousands of miles away? Or what happens when researchers want to leverage the low cost of massive compute provided by cloud providers? For networking teams, extracting large amounts of data for local processing by business partners and cloud data centers is the nightmare of big data.

Data replication

The challenge of big data sharing is what forces the elimination of silos.

In our example, we selected Hadoop as the data source. A Hadoop database lends itself to the distributed nature of this type of data sharing. Copies of the data needed by each stakeholder are placed close to those stakeholders. If the data is static, then it's an exercise in working with the application team to understand data placement and the amount of bandwidth needed to replicate the data across distributed cluster nodes. When the data isn't static, a challenge presents itself.

If the data is updated, then you encounter the challenge of multipoint database synchronization. And these are exceptionally large data sets. It's important to look into systems that integrate network functionality with storage acceleration.

This is an immature market. Many organizations are still struggling with how to specifically enable this distributed model. Some organizations avoid Hadoop and depend on classic flat-file systems, such as a network file system (NFS). With NFS-based data sets, network engineers can implement WAN acceleration and cache devices to speed access.

An obstacle that's still difficult to resolve is file locking. So far, the best solution for shared access is accepting the last update for simultaneous-write conflicts. There's been discussion about extending software-defined WAN capabilities to accelerate access to the Hadoop Distributed File System, which is the underlying system for Hadoop applications.

The governance of data access is another concern. Network professionals, alongside storage and application teams, help to design security controls for data access. In our example, we are using healthcare data. Outside of regulatory issues, IT teams need to support the security requirements of the legal organization.

Network teams must participate in the conversation on cross-organizational data sharing. Specifically, business users need to understand the limits of the technology. For example, organizations need to determine which data is allowed to cross the edge of one organization into another and which controls are used to monitor that data. One method is to place a tag on the data and use network data loss prevention devices to filter traffic. Another option is to encrypt the data and control access via key management. Important considerations here will be deciding who's in charge of the governance of these systems and how audits will be conducted.

Ultimately, internet of things (IoT) and the resulting big data activity are not that different from other applications on the network. Applying traditional data and network principles to data collection, access and collaboration is the best approach.

When designing an IoT and big data strategy, keep in mind the nature of the data. Consider who will access it and from which locations. Understand how data is processed and changed, and take care to understand the security profile of the IoT devices being put to work.

Next Steps

For big data and IoT, storage demands rise

IoT and big data affect the data center

Machine learning impacts big data and IoT

Dig Deeper on SDN and other network strategies