History and Advantages of Hadoop MapReduce Programming
MapReduce was first popularized as a programming model in 2004 by Jeffery Dean and Sanjay Ghemawat of Google (Dean & Ghemawat, 2004). In their paper, “MapReduce: Simplified Data Processing on Large Clusters,” they discussed Google’s approach to collecting and analyzing website data for search optimizations. Google’s proprietary MapReduce system ran on the Google File System (GFS). Apache, the open source organization, began using MapReduce in the “Nutch” project, which is an open source web search engine that still is active today. Hadoop began as a subproject in the Apache Lucern project, which provides text search capabilities across large databases. In 2006, Doug Cutting, an employee of Yahoo!, designed Hadoop, naming it after his son’s toy elephant. While it was originally a subproject, Cutting released Hadoop as an open source Apache project in 2007. (Hadoop, 2011). In 2008, Hadoop became a top level project at Apache. On July 2008, an experimental 4000 node cluster was created using Hadoop, and in 2009 during a performance test, Hadoop was able to sort a terabyte of data in 17 hours.
In the eCommerce industry today, several of the major players use Hadoop for high volume data processing (Borthakur, 2009). Amazon, Yahoo and Zvents use Hadoop for search processing. Using Hadoop helps them to determine search intent from text entered, and optimize future searches using statistical analysis. Facebook, Yahoo, ContextWeb, Joost, and Last.fm use Hadoop to process logs and mine for click stream data. Click stream data records an end users activity and profitability on a web site. Facebook and AOL are using Hadoop in their data warehouse as a way to effectively store and mine the large amounts of data they collect. The New York Times and Eyalike are using Hadoop to store and analyze videos and images.
In order to use MapReduce, you don’t need to have a large investment in infrastructure. Hadoop is also available on many public cloud provider offerings in a pay per use model. Amazon offers an elastic map reduce service. Google and IBM are offering MapReduce in IBM Blue Cloud.
According to a recent study, Hadoop is slower than two state of the art parallel database systems in performing a variety of analytical tasks by a factor of 3.1 to 6.1 (A. Pavlo, 2009). However, where MapReduce excels is in its ability to support elastic scalability, i.e. the allocation of more compute nodes from the cloud to speed up computation. Dean, et al. published initial perform results on an implementation of Map Reduce on a Google File System (a proprietary distributed file system at Google). Approximately 1800 machines participated in this cluster, each with 2GHx Intel Xeon processors with Hyper-Threading enable, 4GB of memory and 320 GB of disk. They performed two functions, a distributed grep command searching for a rare string across 1 terabyte of data, and a sort command, which sorts 1 terabyte of data. The grep command completed in 150 seconds, while the sort completed in ~1000 seconds. RDBMS technologies would require structured data and massive parallel storage in order to match these results.
MapReduce has many perceived advantages over DBMS parallelism. Many of these advantages are still being debated in the industry.
Simple Coding Model: With MapReduce, the programmer does not have to implement parallelism, distributed data passing, or any of the complexities that they would otherwise be faced with. This greatly simplifies the coding task and reduces the amount of time required to create analytical routines.
Scalable: Probably the biggest advantage of MapReduce is the high scalability. It has been reported that Hadoop can scale across thousands of nodes (Anand, 2008). The ability to scale horizontally to such a large degree can be attributed to the combination of distributed file systems and the philosophy of running the worker processes near the data, rather than attempting to move the data to the processes.
Supports Unstructured Data: Unstructured data is data that does not follow a specified format for big data. If 20 percent of the data available to enterprises is structured data, the other 80 percent is unstructured. Unstructured data is really most of the data that you will encounter. Until recently, however, the technology didn’t really support doing much with it except storing it or analyzing it manually.
Because MapReduce processes simple key value pairs, it can support any type of data structure that fits into this model. This includes images, meta-data, large files, etc. The ability for the programmer to deal with irregular data is much easier in MapReduce than DBMS.
Fault Tolerance: Because of its highly distributed nature, MapReduce is very fault tolerant. Typically the distributed file systems that MapReduce support, along with the master process, enable MapReduce jobs to survive hardware failures (Dean & Ghemawat, 2004).
As a computer system scales, it consists of more hardware and more software, and that means there is more to go wrong. On the positive side, it also means that there is more opportunity to automatically overcome a single point of failure.
Conversely, consider a computer system that is distributed across multiple machines connected together by a local area network, such as Ethernet. Each of the machines process data and produces a subset of results, working in conjunction with each other to its end goal. Passing data and messages between the machines is critical to the successful and efficient functioning of the system.