IT shops turn to appliances for columnar databases to enhance relational databases and provide better business intelligence and analytics processing. The four main columnar vendors -- Oracle, HP, SAP and IBM -- each have their own take on the technology.
Oracle gives administrator power
Oracle's Exadata uses a hybrid columnar approach in which the row-oriented and columnar approaches are viewed more or less as one database. In this case, Oracle applies columnar storage and indexing to particular disk blocks, giving the administrator more freedom to choose between the two approaches. However, by all accounts, figuring out the best split involves a lot of extra work.
HP targets data warehousing
HP's Vertica technology for columnar databases focuses on the data warehousing use case, with data compression techniques such as inverted list, bit-mapped indexing and hashing. The database can use the column format to zero in faster on the data that the query requires and, because the data is compressed an average of 10 times (according to my last information from Vertica), more data can be crammed into main memory for faster processing. HP claims 10 to 100 times faster performance compared to row-oriented relational.
The basic data in Vertica is stored as projections, also called materialized views. A set of columns in a tuple is stored one relational way; each column also shows up in a projection, but the projection is cross-tuple (one from tuple A, one from tuple B and so on). This gives an alternative, potentially faster way of querying than basic storage and it improves robustness by adding in redundancy. Consider it similar to RAID 5 -- projections can be striped across disks.
Vertica also addresses the trend of adding updates to data-warehousing databases via load windows, to the point where data is pretty up-to-date. To keep write locks from gumming up performance, Vertica splits the storage into a write-optimized column store (WOS; effectively, a cache) and a read-optimized column store (ROS). Periodically, the WOS becomes the ROS. So the write locks for the updates only interfere with reads when there's a mass update. At the same time, such a mass update can re-store whole chunks of the ROS, optimizing storage efficiency. To gain performance via concurrency, the query can be carried out across the ROS and WOS. All this redundancy eliminates logs, which boosts performance further.
SAP integrates in-memory database
SAP's columnar technology starts with Sybase IQ, but HANA is its newer offering that combines columnar and in-memory database technology. While other vendors claim to provide in-memory technology that works in harmony with columnar and row-oriented databases, SAP's differentiation is that HANA is designed to fully integrate the two. The added performance is more than the sum of its in-memory and columnar parts.
IBM goes at register speed
IBM's BLU Acceleration is the newest and perhaps the most exciting columnar offering. BLU Acceleration loads entire pieces of data -- one or multiple columns, in compressed form -- into a register and applies basic operations to it, without needing to decompress it or break it into parts. The usual parallelism between registers via single-instruction-multiple-data-stream techniques and cross-core parallelism adds to the performance advantage. The speed of the transaction is gated by the speed of the register, not by the speed of main memory access. BLU Acceleration is able to do this on any type of data represented in a columnar format.
By doing register-speed processing on columnar data, BLU Acceleration allows a columnar-format storage and database to beat an equivalent row-oriented relational storage and database over most of today's read-only data processing -- i.e., most reporting and analytics.
The IT industry generally believes that row-orientated databases perform better than columnar when more than one or two columns in a row need to be read in a large-scale transaction, because a column-oriented database needs to seek back and forth across a disk for each needed column that is physically stored together in row-oriented storage. However, as emphasis shifts to registers, main memory becomes the performance key -- main memory is flat storage, where columns can be loaded into the processor simultaneously without seeking around the disk. Solid-state disk is really flat and permanent data storage, sometimes with a disk-access veneer attached. The disk veneer may not be needed if everything can be stored in a gigabyte of main memory plus a terabyte of solid-state disk. Columnar will match or exceed row-oriented queries nearly every time.
In today's databases with flat storage, there is little need for indexing -- IBM claims that BLU Acceleration eliminates the complexities of indexing entirely, though it does contain an index that gives each column a unique ID. Gone is the time-consuming optimizer decision-making about which index to use to generate the best performance, and the time-consuming effort to tune and retune the database indexing and storage. While you'll still need a BLU Acceleration database administrator, it reduces administration time dramatically.
IBM's BLU Acceleration technology does not yet have the write-to-disk capabilities required for update-heavy transactional performance. Also, where there are very high-end applications requiring massive disk storage and involving lots of three or more column reads, row-oriented can likely compete. But BLU Acceleration has added column-based paging to increase column upload speeds; to load from disk or disk-veneer SSD storage into main memory, one swaps in a page defined as containing one or multiple columns.
About the author:
Wayne Kernochan is president of Infostructure Associates, an affiliate of Valley View Ventures. This document is the result of Infostructure Associates-sponsored research. Infostructure Associates believes that its findings are objective and represent the best analysis available at the time of publication.