Hadoop MapReduce Architecture Overview

MapReduce Architecture

Capture 18

  • Each node is part of an HDFS Cluster.
  • Input data is stored in HDFS Spread across nodes and replicated. 
  • Programmer submits job (mapper, reducer, input) to job tracher.
  • Job tracker- Master

            Capture 15Splits input data

            Capture 15Schedules and monitors various map and reduce tasks

  • Task Tracker- slaves

            Capture 15Execute map and reduce tasks

MapReduce Programming model:

  • A programming model is designed by Google, by using which a sub set of distributed computing problems can be solved by writing simple programs.
  • It provides automatic data distribution and aggregation.
  • A simple and powerful interface that enables automatic parallelism and distribution of large–scale compotators, combined with an implementation of this interface that archives high performance on large clusters of commodity PC’s.
  • It partitions the input data and schedules exaction across a set of machines.
  • Handles machine failure and manages inter-process communication.
  • Computation of key value pair from each piece of input.
  • Grouping of intermediate value by key
  • Iteration over resulting group
  • MapReduce works on divide and conquer rule on the data.
  • The basic idea is to partition a large problem into smaller.
  • They can be tackled in parallel by different workers and intermediate results from works are combined to produce the final result.
  • MapReduce is executed in two main phases, called map and reduce.
  • Each phase B is defined by a data processing function and these functions are called map() and reduce()
  • In map phase, MR takes the input data and feeds each data element into mapper.
  • The Reducer process all output from the mapper and arrives at the final output.
  • In order for mapping, reducing, partitioning and shuffling to seamlessly work together, we need to agree on a common structure for data being processed.
  • It should be flexible and powerful enough to handle most of the target data processing application.
  • MapReduce use list and pair<key, value> as its fundamental primitive data structure.
  • Mapreduce job can run with a single method called submit() or wait for Job completion()
  • If the property mapped. Job. Tracker is set to local, the job will run in a single JVM and we can specify the host and port number while running on the cluster.

There are 2 types of Map Reduces

  1. Classic Map Reduce or MRV1
  2. YARN (Yet Another Resource Negotiator)

YARN (Map Reduce 2):-

  • If the cluster size reaches 4000 nodes or more, there will be a scalability bottleneck.
  • If the cluster B is more, then the job tracker cannot handle the job scheduling and task monitoring.
  • YARN separates these two roles into two independent daemons.

Resource Manager: It manages the use of resources across the cluster.

An Application Master:-To manage the life cycle of applications running on the cluster.

Node Master:-Run on the cluster nodes which makes sure that the application does not use more resources than it has been allocated.

  • Contrast to the Job tracker, In YARN, there is an application master for every Map Reduce Job run.
  • If Application Master fails, the resource manager won’t get heart beat messages and will start a new instance of the master.
  • If the resource manager is failed, the administrator will start the new instance of a resource manager and will recover from the saved state.
  • Additional Daemon for YARN Architecture B History server.


  • To serve the mapper, the class implements the mapper inter face and inherits the mapreduce class.
  • The mapreduce class is the base class for both mapper and reduces.

It includes two methods.

  1. Constructor
  2. De-constructor
  • Function called void config(Job con fig), in this function, you can extract the parameters set either by the xml files or main class of your application and calls.
  • Void close() as the last terminates this function and wraps up the loose ends if any i.e. Cpmputing Data box connections/open the files and so on.
  • The mapper interface is responsible for data processing steps if we utilize the mapper <K1,V1,K2,V2> where the key classes and value classes implement the writable –comparable writable interfaces respectively. A single method process the individual key value pairs.

Types of mapper are

1) Identity mapper Capture 15 Implements mapper<k,v> and maps i/p s directly to o/p s

2) Inverse mapper Capture 15 implements mapper<k,v> and reverses the key value pair,

3) Regex mapper Capture 15 implements mapper<k,tent, long writable, writable> and generates a (match, 1) pair for every regular expression match.

4) Token count mapper Capture 15 It implements mapper<k, tent, tent, long writable> and generates(Token,1) pair when the input is tokenized.


  • As with any mapper implementation, a reducer must first extend the map reduce base to allow the configuration and clean up
  • In addition, it must also implement the reducer interface when the reducer task receives the o/p from the various mappers.
  • It sorts the incoming data on the [key, value] pair and groups together all values of the same key, then such reducer() is called and it generates pairs by iterating over the values associated with a given key.
  • The output collector retrieves the o/p of a reducer process and writes into o/p file.
  • The reporter provides an option to record extra information about the reducer and the task processes.

Types of Reduces:-

1) Identify Reducer:- It implements a reducer[key,value] and map inputs directly to the outputs.

2)Long sum reducer Capture 15 It implements a reducer[key, long writable,] to get the given key

Examples:- Word count using map reduce

Mapper Examples:-

Input: <key:, offset, value: line of a document>

Out put: for each word w in input line out put<Key: w value:1>

Input: (2133,Tge quick brown fox jumps over the logy dog)

Output: (the,1), (quicker,1),(brown,1)——(fox,1), (the,1),

Reducer Example:-

Input: <key: word; value: list<integer>>

Output: sum all values from input for the given key input list of values and out <key: word, value: count>

Input:<the,[1,1,1,1]),(for,[ 1,1,1])—–

Output: (the,5)




  • Practitioner partitions the key space.
  • Partition controls the partitioning of the keys of the intermediate map outputs.
  • The key is used to derive the partition, typically by a hash function.
  • The total no of partitions is the same as the number of reduce tasks for the job.
  • Hence this control which of the reduce tasks, the intermediate key is sent for reduction.

Hash practitioner:- It is the default practitioner.

  • If we are not going to specify anything in our map reduce program automatically Hash practitioner will execute the practitioner functionality i.e all the values which are associated with a particular key will be sent to the single reducer.
  • There is no duplication of key in the part-r-file.

Custom practitioner:-

  • Custom practitioner only needs to implement 2 functions
  • The former users uses the Hadoop configuration to configure the partitions and the latest returns an integer b/w the no., if reducer tasks indexing to which the reducer pair will be sent,
  • Between the map and reduce stages, a mapped application takes the o/p of the mapper and distribute the results among the reduce tasks.
  • The process is called shift ling because the o/p of a mapper on a single node may be sent to reduces across.


Public class my practitioner<Text, text> implements partitioner<Text, text>
Int get partition(Text key, text value, int num partitions)
Int hash code=key. hash code();
Int. partition = hash code mod num partitions;
Rehim partition index;


  • Combiner functionality will execute the map reduce framework.
  • Combiner will reduce the amount of intermediate data before sending them to the reducers.
  • Combiner will call when the minimum split size is equal to 3 or>3, then combiner will call the reducer functionality and it will be executed on the single node.
  • Combiner will reduce the amount of data traversed from one node to the other node.
  • In this case, the reducer functionality will execute two times.
  • The o/p of mapper functionality is equal to 16MB i.e one split.

Example:- Public class my combiner extends map Reduce base implements

Reducer<Text, text, text, text>

Should be the Same type as map output key/value

Public void reduce(Text key, Iterator<text>value
 OutputCollector<Text, text> output, Reporter reporter)
Your logic

Record Reader and Record writer:-

  • Record Reader Will read the data from input splits 
  • Record Writer will write the data to an output file from reducer o/p.




0 Responses on Hadoop MapReduce Architecture Overview"

Leave a Message

Your email address will not be published. Required fields are marked *

Copy Rights Reserved © Mindmajix.com All rights reserved. Disclaimer.