A programming model is designed by Google, by using which a subset of distributed computing problems can be solved by writing simple programs.
It provides automatic data distribution and aggregation.
A simple and powerful interface that enables automatic parallelism and distribution of large-scale compotators, combined with an implementation of this interface that archives high performance on large clusters of commodity PCs.
It partitions the input data and schedules exaction across a set of machines.
Handles machine failure and manages inter-process communication.
Computation of key-value pair from each piece of input.
Grouping of intermediate value by key
Iteration over a resulting group
MapReduce works on divide and conquers rule on the data.
The basic idea is to partition a large problem into smaller.
They can be tackled in parallel by different workers and intermediate results from works are combined to produce the final result.
MapReduce is executed in two main phases, called map and reduce.
Each phase B is defined by a data processing function and these functions are called map() and reduce()
In the map phase, MR takes the input data and feeds each data element into mapper.
The Reducer process all output from the mapper and arrives at the final output.
In order for mapping, reducing, partitioning and shuffling to seamlessly work together, we need to agree on a common structure for data being processed.
It should be flexible and powerful enough to handle most of the target data processing application.
MapReduce use list and pair as its fundamental primitive data structure.
MapReduce job can run with a single method called submit() or wait for Job completion()
If the property mapped. Job. Tracker is set to local, the job will run in a single JVM and we can specify the host and port number while running on the cluster.

There are 2 types of Map Reduces

Classic Map Reduce or MRV1
YARN (Yet Another Resource Negotiator)

YARN (Map Reduce 2):-

If the cluster size reaches 4000 nodes or more, there will be a scalability bottleneck.
If the cluster B is more, then the job tracker cannot handle job scheduling and task monitoring.
YARN separates these two roles into two independent daemons.

Resource Manager: It manages the use of resources across the cluster.

An Application Master:-To manage the lifecycle of applications running on the cluster.

Node Master:-Run on the cluster nodes which makes sure that the application does not use more resources than it has been allocated.

Contrast to the Job Tracker, In YARN, there is an application master for every Map Reduce Job run.
If the Application Master fails, the resource manager won’t get heartbeat messages and will start a new instance of the master.
If the resource manager is failed, the administrator will start the new instance of a resource manager and will recover from the saved state.
Additional Daemon for YARN Architecture B History server.

Mapper:

To serve the mapper, the class implements the mapper interface and inherits the MapReduce class.
The MapReduce class is the base class for both mappers and reduces.

It includes two methods.

1. Constructor
2. De-constructor

The function called void config(Job config), in this function, you can extract the parameters set either by the XML files or main class of your application and calls.
Void close() as the last terminates this function and wraps up the loose ends if any i.e. Computing Data box connections/open the files and so on.
The mapper interface is responsible for data processing steps if we utilize the mapper where the key classes and value classes implement the writable –comparable writable interfaces respectively. A single method process the individual key-value pairs.

Frequently Asked MapReduce Interview Questions & Answers

Types of mapper are

1) Identity mapper - Implements mapper and maps i/p s directly to o/p s

2) Inverse mapper - implements mapper and reverses the key-value pair,

3) Regex mapper - implements mapper and generates a (match, 1) pair for every regular expression match.

4) Token count mapper - It implements mapper and generates(Token,1) pair when the input is tokenized.

Reducer:

As with any mapper implementation, a reducer must first extend the map to reduce the base to allow the configuration and clean up
In addition, it must also implement the reducer interface when the reducer task receives the o/p from the various mappers.
It sorts the incoming data on the [key, value] pair and groups together all values of the same key, then such reducer() is called and it generates pairs by iterating over the values associated with a given key.
The output collector retrieves the o/p of a reducer process and writes into o/p file.
The reporter provides an option to record extra information about the reducer and the task processes.

Types of Reduces:-

1) Identify Reducer - It implements a reducer[key, value] and map inputs directly to the outputs.

2) Long sum reducer - It implements a reducer[key, long writable,] to get the given key

Examples:- Word count using map reduce

Mapper Examples:-

Input:

Output: for each word w in input line output

Input: (2133, The quick brown fox jumps over the lazy dog)

Output: (the,1), (quicker,1),(brown,1)——(fox,1), (the,1),

Reducer Example:-

Input: >

Output: sum all values from the input for the given key input list of values and out

Input:

Output: (the,5)

(foa,3)

Ξ

Practitioner:

Practitioner partitions the key space.
Partition controls the partitioning of the keys of the intermediate map outputs.
The key is used to derive the partition, typically by a hash function.
The total no of partitions is the same as the number of reduce tasks for the job.
Hence this control which of the reduce tasks, the intermediate key is sent for reduction.

Hash practitioner:

It is the defaulting practitioner.
If we are not going to specify anything in our map reduce program automatically Hash practitioner will execute the practitioner functionality i.e all the values which are associated with a particular key will be sent to the single reducer.
There is no duplication of a key in the part-r-file.

Custom practitioner:

Custom practitioner only needs to implement 2 functions
The former users use the Hadoop configuration to configure the partitions and the latest returns an integer b/w the no. if reducer tasks indexing to which the reducer pair will be sent,
Between the map and reduce stages, a mapped application takes the o/p of the mapper and distribute the results among the reduce tasks.
The process is called shift ling because the o/p of a mapper on a single node may be sent to reduces across.

Example:-

Public class my practitioner implements partitioner
{
Int get partition(Text key, text value, int num partitions)
{
Int hash code=key. hash code();
Int. partition = hash code mod num partitions;
Rehim partition index;
}
}

Combiner:

Combiner functionality will execute the map-reduce framework.
Combiner will reduce the amount of intermediate data before sending them to the reducers.
Combiner will call when the minimum split size is equal to 3 or>3, then combiner will call the reducer functionality and it will be executed on the single node.
Combiner will reduce the amount of data traversed from one node to the other node.
In this case, the reducer functionality will execute two times.
The o/p of mapper functionality is equal to 16MB i.e one split.

Example:- Public class my combiner extends map Reduce base implements

Reducer

Should be the Same type as map output key/value

{
Public void reduce(Text key, Iteratorvalue
 OutputCollector output, Reporter reporter)
{
Your logic
}
}

Record Reader and Record writer:-

Record Reader Will read the data from input splits
Record Writer will write the data to an output file from reducer o/p.

Explore MapReduce Sample Resumes! Download & Edit, Get Noticed by Top Employers!Download Now!

List of Big Data Courses:

Hadoop Administration	MapReduce
Big Data On AWS	Informatica Big Data Integration
Bigdata Greenplum DBA	Informatica Big Data Edition
Hadoop Hive	Impala
Hadoop Testing	Apache Mahout

Join our newsletter

Stay updated with our newsletter, packed with Tutorials, Interview Questions, How-to's, Tips & Tricks, Latest Trends & Updates, and more ➤ Straight to your inbox!

Course Schedule

Name	Dates
Hadoop Training	Apr 20 to May 05	View Details
Hadoop Training	Apr 23 to May 08	View Details
Hadoop Training	Apr 27 to May 12	View Details
Hadoop Training	Apr 30 to May 15	View Details

Last updated: 04 Apr 2023

About Author

Vinod M

Vinod M is a Big data expert writer at Mindmajix and contributes in-depth articles on various Big Data Technologies. He also has experience in writing for Docker, Hadoop, Microservices, Commvault, and few BI tools. You can be in touch with him via LinkedIn and Twitter.