Creativeapril - Fotolia


Test the water before a data lake implementation

Before rushing into a large-scale data lake implementation, start small and treat the technology as an extension to your existing analytics systems.

Recently, data lakes have started to make waves in the IT industry. Data lakes are data stores combined with an attendant data management system that provides analytics about data -- a capability that is typically stripped from other analytics environments, like a data warehouse or data mart, as part of the data cleansing process.

For example, a data warehouse's extract, transform and load preprocessors eliminate the logs that tell when a system arrived or was inserted into an "operational data store."

But in the industry today, data lakes seem to have at least two definitions. One, which originated from storage companies, is that a data lake is a disk-storage infrastructure that allows for metadata storage. The other, which is primarily marketing-driven, is a lake mixing multiple data stores that aren't typically mixed. By my definition, there is no vendor that sells a full-scale data lake -- rather, people cobble them together using Hadoop and homegrown tools to access the data.

As the initial vendor hype gave way to real-world experimentation, users have found that best practices for data marts don't apply to data lakes. To avoid the mistakes of early users, address a data lake implementation modestly, rather than at a large scale.

Here are some best practices that prove useful while working with data lakes.

Remember that data lakes are exploratory

A data lake implementation should allow organizations to extend existing analytics in an ad-hoc, exploratory fashion.

A data lake implementation should allow organizations to extend existing analytics in an ad-hoc, exploratory fashion. Grow the data types in the data lake from a core of highly current data -- for example, customer transaction logs -- that current analytic systems will not elicit in a timely fashion. Most existing analytics aren't sufficient for a true picture of how your applications behave. Data warehouses, "pure" Hadoop and other data management schemes lose important data.

On his blog, James Dixon, CTO at Pentaho Corp., a provider of big data analytics systems, cites an example: systems such as data warehouses don't capture each step in the buying process that a customer takes, but the transaction logs do. The design of such a buying process may seem straightforward to the typical data architect, but there can be minutes or even hours of infuriating lags in each step.

By discovering lags in the process, users can start the data lake implementation with customer-facing, buying-related transactions. It's important that the analytics are exploratory and important to an enterprise's overall analytics effort, because it's unclear what else will be uncovered once users analyze the customer log timestamps more thoroughly.

What's the difference between data marts, lakes and warehouses?

Data marts are variants of data warehouses. The data warehouse stores slightly older data from across the organization for reporting and analytics. Multiple data marts are a rough equivalent of a data warehouse, typically serving subsidiaries within their own IT environment. You can have multiple data marts feeding into a data warehouse, or just loosely coupled data marts.

Integration is key for data lake implementation

It's also important to fully integrate the data lake with the rest of your enterprise data architecture, including data governance and master data management. Understand which data types matter to the data warehouse or data marts and whether the data in raw form is correct and consistent. Implement data governance practices to avoid analyzing flawed data.

Data lakes in the long run

Data lakes have potential. But they're likely to be just a fad unless we get a much better idea of what they can deliver long term -- unless their benefits are much broader than have been concretely shown so far.

Dixon's example of data warehousing's problems when incorporating time sequencing and spacing is only one instance of how today's analytics continue to rely on simple statistics without considering what "bad" data can tell us. Since a data lake implementation can unearth key "gotchas" in analytics, it's worth exploring for any enterprise. In the long run, however, this requires both experimentation and careful balancing of the data lake and your overall information architecture.

Next Steps

The Hadoop data lake offers new opportunity for big data

Do the leg work before a wide-scale data lake implementation

Sink or swim with Hadoop in a data lake architecture

Dig Deeper on Enterprise data storage strategies