In July, IBM’s zEnterprise announcements featured Smart Analytics Optimizer (SAO), an appliance that IBM vaguely referred to as “columnar.” As details about SAO have accumulated, so
The details of column-oriented databases
For the typical enterprise, business intelligence-type data management costs too much and frequently underperforms. Data in existing data warehouses grows by 50% each year due to compliance and regulatory requirements and enterprise demands that call for storage of new types of data. A business marked by a high rate of change calls for more frequent analyses performed on fresher, time-critical data. As a result, data storage and workloads are growing faster than hardware-performance and price-performance improvements.
In many cases, columnar database technology -- storing relational data as columns instead of rows -- has proven to be an excellent answer to these problems in the largest of very large database implementations (see, for example, recent Sybase IQ and ParAccel 1 and 3 TB TPC-H results). The reason is that columnar databases, compared with relational, can compress the data and indices, sometimes in order of magnitude, and carry out transactions on the compressed data.
Columnar databases vs. row-oriented databases
How do columnar databases compare to traditional row-oriented relational databases in data warehousing? There seems to be some agreement between columnar advocates and row-oriented advocates that sheds some light on the appropriate places for each -- and for hybrid databases that combine both technologies.
Daniel Abadi has written an excellent post that summarizes the database spectrum. If I understand his taxonomy correctly, row-oriented excels in “random reads,” where a single record with more than two to three fields is accessed (or for single inserts and updates); columnar excels for queries across a large number of records whose columns (or “fields”) contain some commonality that makes for effective compression. The hybrids attempt to draw performance advantages from both technologies. In particular, the fractured-mirror hybrid approach operates in a real-world disaster recovery environment and treats one copy of the data store as row-oriented, the other as column-oriented, sending transactions to either as appropriate.
I would argue that the future belongs to pure columnar databases or fractured-mirror databases rather than row-oriented or other flavors of hybrid. Here is my reasoning: As data stores continue to scale rapidly in size, the amount of storage devoted to main memory and solid-state drives (SSDs) is likewise going up. The disadvantage of columnar in “random reads” is therefore moving to become an advantage as the percentage of compressed columns in main memory/SSDs increases, while the percentage of less-compressed rows kept in main memory/SSDs stays about flat.
The same trends make the pure or fractured-mirror columnar database of increasing interest for operational data stores that combine online transaction processing with lots of updates and continuous querying and reporting. For updates, the competition between columnar and row-oriented is getting closer, as many of these updates involve only one to three fields/columns of a row, while updates are most likely to affect the most recent data and therefore the data increasingly likely to be in main memory/SSD cache. For inserts/deletes, updating in-memory indexes immediately along with “write to disk later” means that the need of columnar for multiple I/Os doesn’t have to be a major disadvantage in many cases. And for online backup, the compressed data of the columnar approach wins hands down.
Columnar databases in today’s data warehouses often result in order-of-magnitude improvements in performance versus a comparable relational database. Columnar technology is now well proven (Sybase IQ has been at it for a decade) and widely available (smaller vendors such as Vertica offer it), and Oracle has added a columnar option to Exadata, its database appliance. And now, IBM basically offers a “columnar appliance” that makes decisions about which data should be columnar or row-oriented and then shunts transactions on columnar data to the appliance.
Columnar databases and the mainframe
At first glance, columnar database technology appears suited only for loosely coupled distributed architectures such as x86 and System p. Certainly every columnar vendor, from Vertica to Sybase IQ to VectorData, will tell you that they are designed from the ground up for x86 or operation on three-tier or grid Unix/Linux architectures. However, closer examination reveals that the mainframe has a surprising history of columnar support. Both VectorData and 1010data cite their ancestor as the APL columnar solution on the mainframe in the 1970s. Sybase IQ’s pedigree includes Computer Corp. of America’s Model 204 inverted-list database, which was columnar but not relational.
Moreover, the distributed-architecture approach is not necessarily as well suited as the mainframe to event-stream processing using a cache of historical data -- an increasing use of database technology. The mainframe has more advanced pipelining for better stream-type parallelism, and the Pervasive DataRush technology shows the performance advantages of this type of parallelism, even for large-scale (2 terabytes/hour according the MalStone B metric) stream-type transaction processing.
Finally -- although I see some disagreement from EMC on this -- the recent acceptance of dedupe technology by users means that historical data is well on the way to existing in highly compressed form. This should mean that the columnar approach of operating on compressed data and decompressing only when the data is sent to the user can be applied to historical and archived data from mainframe data warehouses as well.
To sum up: It is very possible that columnar database technology may, in the next three to 10 years, become superior in performance/scalability to relational in most, if not all, data warehouses, in new event-stream applications, in operational data handling, and in disaster recovery and information lifecycle management -- in just about every use case that matters to a large organization. Moreover, the mainframe as the locus of most large-scale data warehousing is the logical springboard for IBM support of columnar technology. Therefore, assuming that IBM decides to follow this technology wave, it is time for mainframe users to start paying attention to columnar technology.
Paying attention to columnar technology
What does “paying attention” mean? In the first place, Smart Analytics Optimizer offers a nice playpen for determining the applications and workloads that will benefit most from transition to columnar storage. Therefore, users should consider employing SAO not only to juice up specific high-end business intelligence applications, but also to experiment with data-warehousing apps that would seem at first blush to be better suited to relational row-oriented storage.
The other task for mainframe users right now is to get familiar with columnar storage management and disaster recovery. While treating the appliance as a “black box” is the easiest and least costly thing to do, understanding why IBM has made the split between row-oriented and columnar, and how the indexing and caching works, may well pay dividends in managing columnar databases on the distributed/Wintel side of the house, or in dealing with Oracle Hybrid Columnar Compression in its Exadata appliance.
Over the past year, I have become convinced that columnar technology, unlike object and object-relational database technology in the 1990s, will only increase in importance and cost advantages as the years go by. Sooner or later, I believe, the mainframe will reflect this increased importance. If it’s later, no harm will be done by a little anticipatory research; if it’s sooner, then this is a prime opportunity for mainframers to share in the next generation of transactional technology and prove their continuing value to the business as a whole.
ABOUT THE AUTHOR: Wayne Kernochan is president of Infostructure Associates, an affiliate of Valley View Ventures. Infostructure Associates aims to provide thought leadership and sound advice to vendors and users of information technology. This document is the result of Infostructure Associates-sponsored research. Infostructure Associates believes that its findings are objective and represent the best analysis available at the time of publication.
What did you think of this feature? Write to SearchDataCenter.com's Matt Stansberry about your data center concerns at firstname.lastname@example.org.
This was first published in October 2010