That scene is an appropriate starting point for a discussion of open source data warehousing. There are many perfectly good commercial products available that can be used to analyze data for operational insights. Why would you turn to an open source solution? It's simple. It's because there's a lot of zeroes missing from the check.
Using open source can significantly reduce the costs of your data warehouse – up to 80% less than a comparable commercial data warehouse product would cost. Furthermore, open source can give smaller companies an opportunity to use data warehouses – opportunities they could never have afforded due to the high price points of the commercial offerings.
This article, presented in two parts, will outline two case studies of open source data warehousing – one small company and one large one, and examine the financial benefits of using open source as the foundation of the applications. I think you'll find the examples eye opening.
Publishing the right books
O'Reilly Media Inc. of Sebastopol, Calif., is the publisher of many technical books and probably familiar to most of the readers of this column. What's probably not so well known is the way the publishing industry actually operates. Acquisition editors, who select which books to publish, are actually entrepreneurs in their own right; each editor carries profit-and-loss responsibility for how well his or her books do.
In the past, deciding which books to publish was primarily based on the experience and gut feel of the editor and other individuals involved in the decision making process. In other words, a lot was left up to luck. Some books did well, others not so well, and others, let's just say, ended up in the bargain bin. O'Reilly decided there must be a better way to decide what books to publish and set out to find it.
While it might seem that deciding what books to publish is a black art, there is a source of information available about what books are actually selling. A weekly data feed, called BookScan, identifies how many copies of each technical book have been sold throughout the U.S. and is available to any publisher who chooses to subscribe to it.
O'Reilly decided to take greater advantage of the BookScan data it had by analyzing it for trends, anomalies, and surprises. In short, they decided to create a data warehouse in order to provide concrete data as an input to the acquisition process. To implement their decision, they hired Roger Magoulas, an expert in data warehouses, as director of Research and Analysis. Roger and I occupied adjacent cubes at Pacific Bell's IT department – in the same building as Scott Adams, who based his Dilbert comic strip on his time working for Ma Bell. Roger shared his experience in building the O'Reilly data warehouse with me.
As you might expect, O'Reilly is a strong proponent of open source. So, it was natural that they wanted their data warehouse based on an open source. Beyond that, however, there were concrete economic reasons for the decision. While O'Reilly's books are known around the world, it is actually a pretty small company. It's privately held and, therefore, doesn't announce financial results, but likely it has yearly revenues of less than $50 million. Consequently, it needed an inexpensive analytic solution, which dictated open source.
Constructing a data warehouse is by no means a slam-dunk. Typically, you decide what domain of data you wish to study and design a data structure (often called a data mart) to store it. The data structure, called a star schema, is designed to speed queries and aid drill-down into the underlying data. Of course, populating the star schema requires pulling data from one or more data sources, and, as you might expect, the data is not always in the right format or consistent, so an initial stage called extraction, transformation, and loading (ETL) occurs to push the data into the star schema. Finally, the data isn't any use without the ability to look at it, and so a data representation tool is used to present data for assessment.
So, you might ask, how did O'Reilly construct these three pieces using all open source products?
For the ETL portion of the warehouse, Perl scripts are used to transform the data received from BookScan. Perl's excellent text processing capabilities makes it easy to modify the data to ensure consistent terminology, categorization, and so on. Because Perl is an easy-to-modify scripting language, it supports an interative ETL process, as the data is examined to ensure its cleanliness.
The star schema data mart itself uses MySQL as the data repository. Its ease of use and lack of licensing fee made it easy to set up the repository quickly and begin working with real data. One of the realities of data warehouses is that as you learn more about the way your business works, you want to modify or add new data elements to analyze. Because it was so easy to get started with MySQL, there was less pressure to "get it right" upfront. This enabled O'Reilly to tune the repository for more precise data as early results were analyzed.
Many data warehousing efforts stumble at the output stage. Once analytical data is available, there is often insatiable demand for new reports illustrating different data patterns and relationships. IT groups can't keep up with the demand, end users grow dissatisfied with the system, and the data warehouse falls into disuse. O'Reilly's team building the data warehouse did something very clever. They used Perl's template toolkit to parameterize the possible queries and generated html reports based on the toolkit's output. This makes it very easy to whip up new reports very quickly and ensure that analysts can look at interesting data relationships very quickly.
So, what is the bottom line in terms of O'Reilly's experience with their data warehouse?
In terms of cost, the decision to use all open source products makes perfect sense for them. The software components of the warehouse cost nothing. It runs on an inexpensive Intel box, so the hardware investment was quite low. O'Reilly employs Roger on staff, so the three or so months it took to construct the data warehous wasn't broken out as a separate expense. Many companies, however, would need to hire specialized talent on a consulting basis to create a warehouse for them, which might run as much as $100,000.
Overall, the company was able to create its warehouse for far less than the $250,000 to $300,000 that a data mart the size of O'Reilly's (around 10Gb) would cost. Their data warehouse effort cost around $75,000 versus the $300,000 a commercial version would have cost. So, O'Reilly saved about 75%. More to the point, O'Reilly's never could have afforded the investment required to fund a commercial product-based data warehouse, and, therefore, would never have been able to analyze the dynamics of their business.
So, what's O'Reilly's experience been with their data warehouse? It's still early in the process. The data warehouse has only been up for a few months, but decisions about what books to publish now have more concrete data about sales trends, up-and-coming areas of technology, and what types of books sell well in a given technology area. The folks at O'Reilly are satisfied with what they've gotten so far, according to Roger. They expect their data warehouse to be an important input to all future publishing decisions.
Open source brings data warehousing within reach for small and medium-sized organizations. Start to think about how your company could improve your decision-making and financial results through the use of an open source data warehouse.
Coming up in Part 2: So, open source was fine for a small data warehouse, but can it handle a large one?
About the author: Bernard Golden is CEO of Navica Inc., a systems integrator based in San Carlos, Calif. He is the author of Succeeding with Open Source (Addison-Wesley, August 2004) and the creator of the Open Source Maturity Model, a formalized method of locating, assessing and implementing open source software.