Published: 20 May 2014
In-memory processing can improve data mining and analysis, and other dynamic data processing uses. When considering in-memory, however, look out for data protection, cost and bottlenecks.
When you need top database speed, in-memory processing provides the ultimate in low latency. But can your organization really make cost-effective use of an in-memory database? It's hard to know whether that investment will pay off in real business value.
And even if the performance boost is justified, is it possible to adequately protect important data kept live in-memory from corruption or loss? Can an in-memory system scale and keep pace with what's likely to be exponential data growth?
There's an ongoing vendor race to address these concerns. Vendors are trying to practically deliver the performance advantages of in-memory processing to a wider IT market as analytics, interactive decision-making and other (near-) real-time use cases become more mainstream.
Memory is the fastest medium
Using memory to accelerate performance of I/O-bound applications is not a new idea; it has always been true that processing data in memory is faster (10 to 1,000 times or more) than waiting on relatively long I/O times to read and write data from slower media -- flash included.
Since the early days of computing, performance-intensive products have allocated memory as data cache. Most databases were designed to internally use as much memory as possible. Some might even remember setting up RAM disks for temporary data on their home PCs back in the MS-DOS days to squeeze more speed out of bottlenecked systems.
Today's in-memory processing takes that concept to the extreme: using active memory (dynamic RAM) to hold current running database code and active data structures, and keep the persistent database in memory. These databases forget about making any slow trips off the motherboard to talk to external media and instead optimize their data structures for memory-resident processing.
Historically, both the available memory density per server and the relatively high cost of memory were limiting factors, but today there are technologies expanding the effective application of in-memory processing to larger data sets. These include higher per-server memory architectures, inline/online deduplication and compression that use extra (and relatively cheap) CPU capacity to squeeze more data into memory, and cluster and grid tools that can scale out the total effective in-memory footprint.
Memory continues to get cheaper and denser. Laptops now come standard with more addressable memory than entire mainframes once had. Today, anyone with a credit card can cheaply rent high-memory servers from cloud providers like Amazon Web Services, which just recently rolled out R3 server instances with an impressive 244 GB RAM, each for $2.80 per hour. Those with deeper pockets can buy rack servers with 4 TB memory from Hewlett-Packard (ProLiant DL980 G7) or 6 TB from Dell (R920), or they can upgrade to an Oracle M6-32 with 32 TB.
In-memory is all the rage for big unstructured data processing as well as for more traditional structured database applications. So-called commodity server clusters with 128 GB of RAM on each node can accommodate in-memory real-time and query tools in Hadoop clusters.
Caching, analysis and transactions -- oh my!
In traditional transactional databases, only hot data records must be held in memory for performance, while cheaper disks are used for low-cost capacity. The database engine has long been tuned for this purpose. Today there may be cases where holding the whole database in memory makes sense. These range from Web-scale, distributed key-value caching stores and condensed analytical databases to real-time "operational intelligence" grids.
In large Web applications, thousands of concurrent users hitting a Web server require the server-side persistence of "session data" each time they interact with a webpage. In these instances, a centralized transactional database can be a big bottleneck. Scalable database products like Memcached and Redis provide fast, in-memory caching of key-value type data.
When an entire structured database needs to be repetitively queried -- as in many kinds of data exploration, mining and analysis -- it is beneficial to host the whole database in memory. Columnar analytical databases designed for business intelligence (BI) have optimized data storage formats, although often in some partially compressed state less suitable for high-volume transactional work. In the race to produce faster analytical insights, suitable in-memory options are evolving.
Leading the charge is SAP's HANA, a scale-out in-memory database designed to host critical enterprise resource planning data. It can provide a near-real-time analytical BI gold mine, and it can also be used for other data. Reports that might take hours to run over an original transactional database can be finished in seconds in HANA. It is also becoming increasingly capable of handling transactions directly, further increasing its utility.
Other columnar analytical database vendors, such Teradata and HP Vertica, strive to use as much memory as possible, such as compressed column data cached in-memory. They also support the analysis of data sets that are still too large for any cost-effective or practical fully in-memory tool. Vertica, for example, has a hybrid in-memory approach that enables fast data loading through memory (eventually staged to disk) with near-real-time access to both on-disk and in-memory data.
Oracle has a full in-memory product for both analytics and fast online transaction processing (OLTP) in its high-end TimesTen/Exadata appliances, and it has added in-memory processing options to its more traditional database software lineup.
Oracle Hybrid Columnar Compression (HCC) is a great example of how to compress transactional data as it ages, while actually accelerating analysis. Last fall, Oracle announced a further evolution with an in-memory processing option that keeps data in an HCC format for fast analytics and as complete transactional rows in memory to speed up OLTP simultaneously.
Recently, Microsoft has started enhancing SQL Server to support more OLTP in memory too. The Hekaton project promises to allow an admin to designate individual database tables to be hosted in memory instead of disk, where they will fully function and interact transparently with regular disk-hosted tables.
Can in-memory data be protected?
One of the first IT concerns with in-memory databases is protecting the data. Dynamic random access memory (DRAM) may be fast, but if you pull the plug, all is lost. And even if it is backed up to persistent media, how quickly can it be restored? After all, if the point of investing in in-memory is its contribution to real-time competitive business value, gross amounts of downtime will be costly.
A common approach is to use local disk or flash solid-state disk as a local backup target to log and persist recent data updates (in case of local outages, reboots, etc.). Unfortunately, that alone is insufficient if the host server goes down hard or is lost altogether.
A good example of how to fully protect an in-memory database like SAP HANA is using a recent release of HP's Data Protector and StoreOnce Federated Deduplication. Data Protector picks up HANA data as it flows out changes through a third-party application programming interface "pipe." It then targets the backup into StoreOnce globally deduped storage.
If there are any issues, HANA can be directly recovered (or migrated) from Data Protector through HANA's interfaces, either locally or at a remote site via a StoreOnce replicated target.
Laying out the grid
Financial houses have long used in-memory applications built on scale-out memory grids to generate operational intelligence in real time on large amounts of very fast streaming data. Grid offerings from Tibco, ScaleOut Software, Pivotal and GridGain go beyond in-memory database features to provide a broader computing platform in which database operations, big data processing and other data-intensive compute tasks can all operate in near real time. If you are building your own lightning-fast in-memory application, you're probably already using a grid-based approach.
Grid platforms aim to ensure that enterprise capabilities for data protection, disaster recovery and high availability are built-in. GridGain, for example, can tolerate multiple node failures, has native data center replication and even supports in-place upgrades with zero downtime.
ScaleOut offers a free license up to a certain size, while GridGain has recently open-sourced a community edition.
The end all and be all?
Before buying a big in-memory system for its performance boost, take a deeper look at what performance is actually required and exactly where current bottlenecks occur. Memory is still not as cheap as hard disk or flash, and there may be other ways to get the required performance with less cost, effort and disruption.
If you have a MySQL or MongoDB database, one fast (and cheap) option is to upgrade the underlying database engine from the default to a performance-optimized one, like Tokutek's "fractal" indexing technology. It's certainly worth considering a free upgrade that doesn't involve any disruption to infrastructure or applications.
Still, memory will continue to come down in price and increase in per-server density. Memory configurations may soon include finer grained server memory options, including stunning amounts of processor cache and maybe even non-volatile memory (flash). With more memory options, more data processing will move in-memory.
At the same time, data growth will outpace advances in memory, so there will always be a need to stage and tier data across cheaper media at large scales. A key IT capability -- and opportunity -- will continue to be in optimizing the complex data environment by dynamically directing where data flows and managing which data resides in the fastest media.
But the rise of in-memory scale-out databases for structured data, as well as scale-out big data for unstructured data, also underscores a more fundamental shift in data processing. As data becomes more dynamic, more streaming, more cluster-hosted and more "online," compute tasks will increasingly need to be dynamically routed to relevant live data rather than recalling static data into compute-centric hosts.
In other words, compute will need to flow like data, while data will reside in the server. The management and intelligence required to optimize that future just might be the next bigger thing.
About the author
Mike Matchett is a senior analyst and consultant at Taneja Group.