Abstract of MapReduce
Organizations collect many types of data about the processes they support: marketing, operational, activity logging, etc. For example, “click stream” and log data provide a record of the end user’s activity during past visits to a web site. Shopping cart data provides information on what items a customer intends to purchase, and checkout data records the items that were eventually purchased. Companies like Amazon.com and Netflix provide examples of how information is used by organizations to enhance the end users experience. In the Amazon.com retail web portal, a person can see product recommendations for products based on users with similar purchasing habits, items that were eventually purchased after a product is viewed, and additional typical items purchased, with the product being viewed. In Netflix portal, movie recommendations are provided to users, based on sophisticated analytical models such as k-nearest-neighbor, attempting to find titles of interest based on other customers with similar interests. The ability to find customer insights in data provides an advantage to organizations. As a result, more organizations are capturing and storing large amounts of data. Organizations need an effective method to analyze this data, which in many cases may be distributed across multiple compute nodes.
“MapReduce” is a programming model which was created specifically for processing large data sets on multi-node hardware efficiently using distributed processes. The MapReduce programming model exposes two interfaces which the software engineer must implement: map() and reduce(). The map function processes simple key value data structures, while a reduce function merges all of the intermediate values associated with the same intermediate key (Dean & Ghemawat, 2004).
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model.
Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.
Companies are using MapReduce to analyze terabytes of distributed data on inexpensive multiple node utility compute environments. Systems that implement MapReduce provide a simple programming paradigm, while abstracting from the programmer the complexity of writing code for parallelization, fault-tolerance, data distribution, load balancing and coordination of distributed processes. Programmers who have little understanding of distributed processing and parallelization are able to leverage MapReduce systems, such as Apache Hadoop, an open source java project.
This research discusses the MapReduce programming model and its benefits to organizations in analyzing extremely large and distributed datasets. MapReduce programming model is reviewed, including the motivation, programming model, architecture, and some of the popular implementations. Finally, the research will discuss several interesting case studies of MapReduce for processing large distributed data. This research is intended for audiences who have knowledge of computer science and parallelism, but who do not have specific knowledge of MapReduce technologies.
Organizations today are in a position to collect millions of details about factors that influence their mission, including customer data, operational data, external data, etc. Electronic channels of communication provide an easy way for interactions to be logged for further analysis. For eCommerce, search engines, social media, and many other large scale systems, the amount of data that firms can collect has steadily increased from gigabytes, to terabytes, and again to petabytes. Amazon.com, the world’s largest online retailer, maintains extensive data on its 59 million active customers to the tune of 42 terabytes (Focus). Google, the search industry king, stores detailed data from about 91 million searches per day, or roughly 50% of all the searches across the internet. To provide personalized and relevant searches, it stores virtual profiles of countless number of users. The National Energy Research Scientific Computing Center (NERSC) in Oakland California stores information about a variety of scientific subjects such as atomic energy research, physics, etc. The NERSC database is the 2nd largest in the world, with 2.8 petabytes (approximately 2800 terabytes) worth of data. To put the NERSC data into perspective, it is believed that the total amount of words spoken throughout the history of humanity is about 5 extabytes. In relative terms, the NERSC database is equivalent to .055% of the size of that figure.
The industry forces that enable organizations to efficiently store these astonishingly large amounts of data primarily are influenced by the emergence of utility computing and distributed processes which are optimized for these environments. Utility computing, the packaging of computing resources such as computation, memory, storage and services as metered services, has emerged as a cost effective way to organize server farms. For the last several years, the information technology industry have steadily migrated from large, centralized compute platforms such as mainframes, to clusters of distributed smaller and less expensive compute nodes. While these utility compute environments greatly reduce the cost of storing large volumes of data, distributed processing is essential to processe the large amounts of data in a timely manner (Kyong-Ha Lee, 2011). MapReduce, popularized by Dean and Ghemawat of Google in their 2004 article, ”MapReduce: Simplified Data Processing on Large Clusters”, is one of the emerging programming models that enables this.
The MapReduce model has been implemented in a several projects, the most widespread of which is Apache Hadoop. Hadoop is Apache’s open source java based framework, which implements both MapReduce pattern, as well as a number of other features that this research will summarize: Hadoop Distributed File System (HDFS), Hive, Pig, and HBase. Hadoop MapReduce is also supported on other utility compute environments such as Amazon S3. Google developed a proprietary platform Google File System (GSF) which supports MapReduce.