Blog

Hadoop Mapreduce Architecture Overview

MapReduce Architecture

  • Each node is part of an HDFS CLUSTER.
  • Input data is stored in HDFS Spread across nodes and replicated. 
  • Programmer submits job (mapper, reducer, input) to job tracher.
  • Job tracker- Master

            Splits input data

                      Schedules and monitors various map and reduce tasks

  • Task Tracker- slaves

          Execute map and reduce tasks

Mapreduce Programming Model :

  • A programming model is designed by Google, by using which a sub set of distributed computing problems can be solved by writing simple programs.
  • It provides automatic data distribution and aggregation.
  • A simple and powerful interface that enables automatic parallelism and distribution of large–scale compotators, combined with an implementation of this interface that archives high performance on large clusters of commodity PC’s.
  • It partitions the input data and schedules exaction across a set of machines.
  • Handles machine failure and manages inter-process communication.
  • Computation of key value pair from each piece of input.
  • Grouping of intermediate value by key
  • Iteration over resulting group
  • MapReduce works on divide and conquer rule on the data.
  • The basic idea is to partition a large problem into smaller.
  • They can be tackled in parallel by different workers and intermediate results from works are combined to produce the final result.
  • MapReduce is executed in two main phases, called map and reduce.
  • Each phase B is defined by a data processing function and these functions are called map() and reduce()
  • In map phase, MR takes the input data and feeds each data element into mapper.
  • The Reducer process all output from the mapper and arrives at the final output.
  • In order for mapping, reducing, partitioning and shuffling to seamlessly work together, we need to agree on a common structure for data being processed.
  • It should be flexible and powerful enough to handle most of the target data processing application.
  • MapReduce use list and pair as its fundamental primitive data structure.
  • Mapreduce job can run with a single method called submit() or wait for Job completion()
  • If the property mapped. Job. Tracker is set to local, the job will run in a single JVM and we can specify the host and port number while running on the cluster.

There are 2 types of Map Reduces

1. Classic Map Reduce or MRV1
2. YARN (Yet Another Resource Negotiator)

Interested in mastering MapReduce? Enroll now for FREE demo on MapReduce training

YARN (Map Reduce 2):-

  • If the cluster size reaches 4000 nodes or more, there will be a scalability bottleneck.
  • If the cluster B is more, then the job tracker cannot handle the job scheduling and task monitoring.
  • YARN separates these two roles into two independent daemons.

Resource Manager: It manages the use of resources across the cluster.

An Application Master:-To manage the life cycle of applications running on the cluster.

Node Master:-Run on the cluster nodes which makes sure that the application does not use more resources than it has been allocated.

  • Contrast to the Job tracker, In YARN, there is an application master for every Map Reduce Job run.
  • If Application Master fails, the resource manager won’t get heart beat messages and will start a new instance of the master.
  • If the resource manager is failed, the administrator will start the new instance of a resource manager and will recover from the saved state.
  • Additional Daemon for YARN Architecture B History server.

Mapper:-

  • To serve the mapper, the class implements the mapper inter face and inherits the mapreduce class.
  • The mapreduce class is the base class for both mapper and reduces.

It includes two methods.

1. Constructor
2. De-constructor

  • Function called void config(Job con fig), in this function, you can extract the parameters set either by the xml files or main class of your application and calls.
  • Void close() as the last terminates this function and wraps up the loose ends if any i.e. Cpmputing Data box connections/open the files and so on.
  • The mapper interface is responsible for data processing steps if we utilize the mapper where the key classes and value classes implement the writable –comparable writable interfaces respectively. A single method process the individual key value pairs.

Frequently Asked MapReduce Interview Questions & Answers

Types of mapper are

1) Identity mapper Implements mapper and maps i/p s directly to o/p s

2) Inverse mapper implements mapper and reverses the key value pair,

3) Regex mapper implements mapper and generates a (match, 1) pair for every regular expression match.

4) Token count mapper  It implements mapper and generates(Token,1) pair when the input is tokenized.

Reducer:-

  • As with any mapper implementation, a reducer must first extend the map reduce base to allow the configuration and clean up
  • In addition, it must also implement the reducer interface when the reducer task receives the o/p from the various mappers.
  • It sorts the incoming data on the [key, value] pair and groups together all values of the same key, then such reducer() is called and it generates pairs by iterating over the values associated with a given key.
  • The output collector retrieves the o/p of a reducer process and writes into o/p file.
  • The reporter provides an option to record extra information about the reducer and the task processes.

Types of Reduces:-

1) Identify Reducer:- It implements a reducer[key,value] and map inputs directly to the outputs.

2)Long sum reducer   It implements a reducer[key, long writable,] to get the given key

Examples:- Word count using map reduce

Mapper Examples:-

Input:

Out put: for each word w in input line out put

Input: (2133,Tge quick brown fox jumps over the logy dog)

Output: (the,1), (quicker,1),(brown,1)——(fox,1), (the,1),

Reducer Example:-

Input: >

Output: sum all values from input for the given key input list of values and out

Input:

Output: (the,5)

(foa,3)

Ξ

Practitioner:-

  • Practitioner partitions the key space.
  • Partition controls the partitioning of the keys of the intermediate map outputs.
  • The key is used to derive the partition, typically by a hash function.
  • The total no of partitions is the same as the number of reduce tasks for the job.
  • Hence this control which of the reduce tasks, the intermediate key is sent for reduction.

Hash practitioner:- It is the default practitioner.

  • If we are not going to specify anything in our map reduce program automatically Hash practitioner will execute the practitioner functionality i.e all the values which are associated with a particular key will be sent to the single reducer.
  • There is no duplication of key in the part-r-file.

Custom practitioner:-

  • Custom practitioner only needs to implement 2 functions
  • The former users uses the Hadoop configuration to configure the partitions and the latest returns an integer b/w the no., if reducer tasks indexing to which the reducer pair will be sent,
  • Between the map and reduce stages, a mapped application takes the o/p of the mapper and distribute the results among the reduce tasks.
  • The process is called shift ling because the o/p of a mapper on a single node may be sent to reduces across.

Example:-

Public class my practitioner implements partitioner
{
Int get partition(Text key, text value, int num partitions)
{
Int hash code=key. hash code();
Int. partition = hash code mod num partitions;
Rehim partition index;
}
}

Combiner:-

  • Combiner functionality will execute the map reduce framework.
  • Combiner will reduce the amount of intermediate data before sending them to the reducers.
  • Combiner will call when the minimum split size is equal to 3 or>3, then combiner will call the reducer functionality and it will be executed on the single node.
  • Combiner will reduce the amount of data traversed from one node to the other node.
  • In this case, the reducer functionality will execute two times.
  • The o/p of mapper functionality is equal to 16MB i.e one split.

Example:- Public class my combiner extends map Reduce base implements

Reducer

Should be the Same type as map output key/value

{
Public void reduce(Text key, Iteratorvalue
 OutputCollector output, Reporter reporter)
{
Your logic
}
}

Record Reader and Record writer:-

  • Record Reader Will read the data from input splits 
  • Record Writer will write the data to an output file from reducer o/p.
Explore MapReduce Sample Resumes! Download & Edit, Get Noticed by Top Employers!Download Now!

RELATED COURSES

Get Updates on Tech posts, Interview & Certification questions and training schedules