Now, O'Reilly's data warehouse enabled them to better understand marketplace trends – hot topics, seasonality of purchasing, types of books in demand, as well as a number of other factors. Indeed, one of the lessons that Sebastopol, Calif.-based O'Reilly learned is that once you begin to dive into your data warehouse, you discover new factors you want to analyze; therefore, it's important to recognize that a data warehouse is likely to evolve over time.
O'Reilly tracks a universe of less than 50,000 books, their business intelligence system is small potatoes in the world of data warehousing – less than 10Gb. Looking at that company's open source approach to data warehousing, the typical question would be: "Can it scale?" To answer that question, let's look at an open source data warehousing project at the opposite end of the spectrum, size-wise – Sabre Holdings' 1.2 Tb flight leg analysis system.
Sabre Holdings is the parent company of Travelocity, one of the major travel booking Internet sites. The number of transactions that go through Travelocity each day is staggering – about 250 per second, each second of the day. But the number of transactions is far outweighed by the number of queries – over 1,000 per second. Each query requires a huge number of data elements to be considered: airlines, type of flight, class, segments (i.e., flight legs), prices, and so on. In fact, a given query can require searching through 3 billion elements.
Travelocity's goal is to be the fastest, most effective, and cheapest place to purchase an airline ticket on the Internet. To reach that goal, Travelocity constantly assesses how people are using their system – which of the many options presented to a visitor he or she selects. Travelocity wants to understand how people interact with the system so that it can more effectively present information and make it easier for users to find the information they're looking for – and then purchase a ticket, which is where the company makes money.
To help understand user behavior better, Travelocity has built a data warehouse that captures every flight presented to every user, the choices they make to navigate through the screens, and the ultimate flight purchased (if any). With this data, analysts can sift through it and look for patterns, compare navigation paths, pricing responses, and a myriad of other factors.
Given the number of queries and transactions outlined earlier, you might imagine that the data warehouse is big; it is – mighty big. Each month about 100 Gb of Travelocity data is captured for analysis. Sabre keeps a year's worth of data on hand for a total of 1.2Tb, made up of 3.8 billion rows of individual data elements.
Sabre's data warehouse is set up differently from O'Reilly's. O'Reilly implemented a classic three-tier data warehouse: Extraction, Transformation and Loading (ETL) performed via Perl scripts, a star schema data mart based on MySQL, and reports generated via the Perl Template Toolkit. Travelocity's open source data warehouse loads data from a smaller commercial software-based data warehouse. Therefore, the ETL stage has already been performed prior to the data being loaded into the first data warehouse.
Sabre also eschewed a star schema arrangement for the data. Star schemas make it easier to "drill down" into interesting data, but achieve that at the expense of query flexibility, especially the ability to extend queries to new data elements. Since Sabre isn't sure of what analytical connections it will make as it examines the data, a standard table layout has been used. However, Sabre is using MySQL for its data mart, just like O'Reilly.
Sabre's choice for output differs from O'Reilly as well. Rather than using the Template Toolkit, Sabre is using a commercial statistical analysis package from SAS; this is by far the most expensive piece of the warehouse. While Sabre could not reveal their cost for the package, due to competitive reasons, the list price for the SAS products they use could easily run to around $100,000.
Clearly, the key question regarding Sabre's huge MySQL data warehouse is whether the database software can scale to handle such a large collection of data. Mike Benzinger, senior principal of the Sabre Research Group, indicates that there have been no problems at all with the response times of the data warehouse. This makes sense when you consider that MySQL's heritage: serving as the data repository for dynamic websites that deliver lots of data to multitudes of users. This quality is what makes MySQL widely used in heavily-trafficked sites like Yahoo and Slashdot and serves it well as a repository for a large data mart.
With 2003 sales of over $2 billion, Sabre clearly is not too small to afford commercial data warehousing software, like our other example, O'Reilly Media. However, you may have noticed that the travel industry is in poor shape these days. The woes of the airlines are well known, but the problems extend to every segment of the industry. All travel-oriented companies operate on razor-thin margins these days, so even a company of Sabre's size needs to pinch pennies.
Consequently, Sabre was extremely motivated to avoid the $500,000 cost of a 1+ Tb commercial data warehousing product. Its total expense to build its MySQL-based data warehouse ran far less, and here are some of the product prices involved. (These are estimates based on manufacturer list prices; Sabre won't release prices for competitive reasons, and actual prices will vary according to contract negotiations.)
- 4-way Opteron server running Red Hat: $25,000
- RAID Sata disk array: $20,000
- SAS Package: $100,000
The total comes to $145,000. So, the MySQL-based data warehouse is far cheaper -- on the order of 70% less -- than an alternative commercial offering.
While Sabre is much, much larger than O'Reilly, they share a common problem: an inability to afford a data warehouse built on a commercial product. In O'Reilly's case, it's a matter of the company being too small to afford even the minimum price point of the commercial alternatives. For Sabre, budget pressures forced them to build their analytical system on open source.
These two examples highlight one of the huge benefits of open source: making systems affordable for organizations that, for whatever reason, cannot handle the cost of a commercial system.
O'Reilly's example offers a salutary lesson. Don't imagine that you're not big enough to make use of analytical tools. If O'Reilly can apply insights from a data warehouse to their business, so can you.
Sabre offers a different lesson. Open source provides a way to implement important systems even in tough times. Don't overlook creative ways to get the job done. Operational improvements made during tough times will make you company stronger when things turn up, but sitting on your hands because of budget pressures will consign your company to permanent tough times.
About the author: Bernard Golden is CEO of Navica Inc., a systems integrator based in San Carlos, Calif. He is the author of Succeeding with Open Source (Addison-Wesley, August 2004) and the creator of the Open Source Maturity Model, a formalized method of locating, assessing and implementing open source software.