While participating in a panel at last year's Techonomy conference, Google CEO Eric Schmidt cited an astonishing fact, noting that we are creating as much information every two days as was created in the entire history of mankind through 2003. This deluge of information has ushered in a series of technological breakthroughs that allows organizations to grapple with data stores stretching into the hundreds of gigabytes and even petabytes. Google's contributions in this area are particularly noteworthy, including its work on MapReduce, an approach to large-scale distributed data processing of the kind Google employs to tally all occurrences of a keyword or phrase located within its trove of indexed resources (mapping the data), and then return the tally and list of locations to the user (reducing the mapped data into a cohesive result). Mapping and reduction operations can be applied to any task responsible for analyzing large amounts of complex data spread across multiple sources, including pattern recognition, graph analysis, risk assessment, and prediction models.
While Google's MapReduce implementation is proprietary, several open source implementations of the MapReduce concept exist, including Apache Hadoop. In fact, Hadoop has become the de facto solution for distributed data processing, with dozens of global organizations heavily invested in the project from both implementational and developmental standpoints. Among many others, Adobe, Amazon, AOL, Baidu, EBay, Facebook, Hulu, IBM, Last.fm, LinkedIn, Ning, Twitter, and Yahoo! count themselves as users. Adoption isn't limited to Internet heavyweights, with numerous universities, hospitals and research centers identifying themselves as users.
Introducing the Hadoop Project
Like many projects fostered by the Apache Software Foundation (ASF), Hadoop is an umbrella term assigned to the foundation's comprehensive effort to produce "open source software for reliable, scalable, distributed computing.” Currently four subprojects comprise the effort, including:
- Hadoop Common: Hadoop Common forms the core of the Hadoop project, providing the "plumbing" required by the following sibling projects.
- HDFS: The Hadoop Distributed File System (HDFS) is the storage system responsible for replicating and distributing the data throughout the computing cluster.
- MapReduce: MapReduce is the software framework developers use to write the applications which process the data stored throughout the HDFS.
- ZooKeeper: ZooKeeper is responsible for coordinating the configuration data, process synchronization, and other network-related services which all distributed applications require in order to effectively operate.
Therefore while you will indeed download Hadoop in the form of a single archive file, keep in mind that what you are downloading is actually all four subprojects that work in concert to implement the mapping and reduction process.
Experimenting with Hadoop
Despite the complex nature of the problems Hadoop strives to solve, getting started using the project is surprisingly easy. As an example I thought it would be interesting to use Hadoop to perform a word frequency analysis of my book Easy PayPal with PHP. The task will sift through the entire book (approximately 130 pages in length), and produce a grouped list of all words which appear in the book, accompanied by the frequency in which each word appears.
After installing Hadoop, I converted my book from the PDF to a text file using Calibre. The Hadoop wiki also contains similar instructions however the former resource contains slightly updated instructions due to relatively recent changes to the Hadoop configuration process.
Next I copied the book from a temporary location to the Hadoop distributed file system using the following command:
$ ./bin/hadoop dfs -copyFromLocal /tmp/easypaypalwithphp/ easypaypalwithphp
You can confirm the copy was successful using the following command:
$ ./bin/hadoop dfs -ls
drwxr-xr-x - hadoop supergroup 0 2011-01-04 12:48 /user/hadoop/easypaypalwithphp
Next, use the example WordCount script packaged with Hadoop to perform the word frequency analysis:
$ ./bin/hadoop jar hadoop-mapred-examples-0.21.0.jar wordcount \
> easypaypalwithphp easypaypalwithphp-output ...
11/01/04 12:51:38 INFO mapreduce.Job: map 0% reduce 0%
11/01/04 12:51:48 INFO mapreduce.Job: map 100% reduce 0%
11/01/04 12:51:57 INFO mapreduce.Job: map 100% reduce 100%
11/01/04 12:51:59 INFO mapreduce.Job: Job complete: job_201101041237_0002
11/01/04 12:51:59 INFO mapreduce.Job: Counters: 33
Finally, you can review the output contents using the following command:
$ ./bin/hadoop dfs -cat easypaypalwithphp-output/part-r-00000
The example WordCount frequency analysis script is pretty rudimentary, assigning equal weight to every single string in the book text, including code. However it would be fairly trivial to revise the script in order to parse for instance DocBook-formatted files and ignore the code. In any case, consider a situation where you wanted to create a service such as the Google Books Ngram Viewer, which sifts more than 5.2 million books for keyword phrases.
Learn More About Hadoop
More about Hadoop can be found on the following online resources:
- The Hadoop Wiki: The wiki contains dozens of links pointing to tutorials, documentation, user group and conference information, and much more
- Follow Hadoop on Twitter @hadoop: A resource for all of the latest Hadoop news
- Ten Common Hadoop-able Problems: Wondering what sorts of problems could be tackled with Hadoop? Check out this great presentation by Cloudera Chief Scientist Jeff Hammerbacher.
- Hadoop Training Videos: Hadoop training and services provider Cloudera currently offers 13 videos introducing various aspects of Hadoop, MapReduce, and related Hadoop projects.
About the author
Jason Gilmore is founder of the publishing, training, and consulting firm WJGilmore.com. He is the author of several popular books Easy PHP Websites with the Zend Framework, Easy PayPal with PHP, and Beginning PHP and MySQL, Fourth Edition. Follow him on Twitter at @wjgilmore.