Providing the infrastructure for big data and the newer fast data is not yet a matter of applying cookie-cutter best practices. Both require significant tuning or a change of both hardware and software infrastructure.
The newer fast data architectures differ significantly from big data architectures and the tried-and-true online transaction processing tools that fast data supplements. Understanding big data and fast data's requirement changes will inform your foray into the hardware and software choices.
Big data architectures
Big data is about analyzing and gaining deeper insights from much larger pools of data than enterprises typically gathered in the past. Much of the data (e.g., social-media data about customers) is accessible in public clouds. This data, in turn, emphasizes speedy access and deemphasizes consistency, leading to a wide array of Hadoop big data tools. Thus, the following changes in architecture and emphasis are common:
- Support for in-house software, such as Hadoop and Hive, and scale-out, cloud-enabled hardware, to use as a staging place for social-media or other big data inputs.
- Virtualization and other private-cloud enablement software for existing analytics data architectures.
- Software support for large-scale, deep-dive and ad hoc analytics, plus software tools to allow data scientists to customize for the enterprise's needs.
- Massive expansions of storage capacity, particularly for near-real-time analytics.
Fast data architectures
Fast data is about handling streaming sensor-driven and Internet of Things data in near real time. That means a focus on rapid updates, with frequent loosening of the constraint to lock data from reads until it is written to disk. The enterprise working with this architecture typically applies some initial streaming analytics to the data, either from existing, typically columnar, databases or from specially designed Hadoop-associated tools. The following changes in architecture and emphasis are common in this nascent field:
- Database software designed for rapid updates and streaming initial analytics.
- Large enhancement of use of nonvolatile RAM and solid-state drives for fast data storage (e.g., 1 terabyte of main memory and 1 petabyte of SSD).
- Software constraints on time of response that resemble those of the old real-time operating system.
Putting fast and big data together
Fast data is intended to work with big data architectures. Thus, to mesh the two:
- Data is separated on disk between quick-response fast data and the less-constrained big data data stores.
- The architecture allows access by big data databases and analytics tools to fast data data stores.
This is a very brief overview of typical implementations and there are a range of choices. Major vendors sell a wide variety of software and hardware to cover all of big data and much of fast data, while groups of open source vendors cover much of the same software territory. Therefore, both fast and big data implementation is often a matter of balancing cost versus speed to implement. Smart buyers can gain competitive advantage by ramping up an effective architecture.
Some small vendors in the fast data field include Redis Labs and GridGain. Larger vendors Oracle and SAP have made plays in both fast and big data architectures. SAP may be the more appropriate vendor for fast data tools. In the hardware space, Intel is keenly interested in fast data. Other traditional big data vendors, like IBM and Dell -- in the process of acquiring EMC -- haven't expressed as much excitement as of yet. Of the two, EMC has made more of a splash in flash, so it may be more pertinent to fast data than IBM in the future.
Editor's note: IBM has put forth a blog post detailing its options for fast data.
Cloud, legacy IT vendors vie for big data workloads
Where should big data live?