Hadoop Mapreduce Architecture Overview
- Each node is part of an HDFS CLUSTER.
- Input data is stored in HDFS Spread across nodes and replicated.
- Programmer submits a job (mapper, reducer, input) to job tracker.
- Job tracker- Master
- Splits input data
- Schedules and monitors various map and reduce tasks
- Task Tracker- slaves
- Execute map and reduce tasks
MapReduce Programming Model :
- A programming model is designed by Google, by using which a subset of distributed computing problems can be solved by writing simple programs.
- It provides automatic data distribution and aggregation.
- A simple and powerful interface that enables automatic parallelism and distribution of large-scale compotators, combined with an implementation of this interface that archives high performance on large clusters of commodity PCs.
- It partitions the input data and schedules exaction across a set of machines.
- Handles machine failure and manages inter-process communication.
- Computation of key-value pair from each piece of input.
- Grouping of intermediate value by key
- Iteration over a resulting group
- MapReduce works on divide and conquers rule on the data.
- The basic idea is to partition a large problem into smaller.
- They can be tackled in parallel by different workers and intermediate results from works are combined to produce the final result.
- MapReduce is executed in two main phases, called map and reduce.
- Each phase B is defined by a data processing function and these functions are called map() and reduce()
- In map phase, MR takes the input data and feeds each data element into mapper.
- The Reducer process all output from the mapper and arrives at the final output.
- In order for mapping, reducing, partitioning and shuffling to seamlessly work together, we need to agree on a common structure for data being processed.
- It should be flexible and powerful enough to handle most of the target data processing application.
- MapReduce use list and pair as its fundamental primitive data structure.
- MapReduce job can run with a single method called submit() or wait for Job completion()
- If the property mapped. Job. Tracker is set to local, the job will run in a single JVM and we can specify the host and port number while running on the cluster.
There are 2 types of Map Reduces
- Classic Map Reduce or MRV1
- YARN (Yet Another Resource Negotiator)
YARN (Map Reduce 2):-
- If the cluster size reaches 4000 nodes or more, there will be a scalability bottleneck.
- If the cluster B is more, then the job tracker cannot handle the job scheduling and task monitoring.
- YARN separates these two roles into two independent daemons.
Resource Manager: It manages the use of resources across the cluster.
An Application Master:-To manage the lifecycle of applications running on the cluster.
Node Master:-Run on the cluster nodes which makes sure that the application does not use more resources than it has been allocated.
- Contrast to the Job Tracker, In YARN, there is an application master for every Map Reduce Job run.
- If Application Master fails, the resource manager won’t get heartbeat messages and will start a new instance of the master.
- If the resource manager is failed, the administrator will start the new instance of a resource manager and will recover from the saved state.
- Additional Daemon for YARN Architecture B History server.
- To serve the mapper, the class implements the mapper interface and inherits the MapReduce class.
- The MapReduce class is the base class for both mappers and reduces.
It includes two methods.
- Function called void config(Job con fig), in this function, you can extract the parameters set either by the xml files or main class of your application and calls.
- Void close() as the last terminates this function and wraps up the loose ends if any i.e. Cpmputing Data box connections/open the files and so on.
- The mapper interface is responsible for data processing steps if we utilize the mapper where the key classes and value classes implement the writable –comparable writable interfaces respectively. A single method process the individual key value pairs.
Frequently Asked MapReduce Interview Questions & Answers
Types of mapper are
1) Identity mapper - Implements mapper and maps i/p s directly to o/p s
2) Inverse mapper - implements mapper and reverses the key-value pair,
3) Regex mapper - implements mapper and generates a (match, 1) pair for every regular expression match.
4) Token count mapper - It implements mapper and generates(Token,1) pair when the input is tokenized.
- As with any mapper implementation, a reducer must first extend the map reduce base to allow the configuration and clean up
- In addition, it must also implement the reducer interface when the reducer task receives the o/p from the various mappers.
- It sorts the incoming data on the [key, value] pair and groups together all values of the same key, then such reducer() is called and it generates pairs by iterating over the values associated with a given key.
- The output collector retrieves the o/p of a reducer process and writes into o/p file.
- The reporter provides an option to record extra information about the reducer and the task processes.
Types of Reduces:-
1) Identify Reducer - It implements a reducer[key, value] and map inputs directly to the outputs.
2) Long sum reducer - It implements a reducer[key, long writable,] to get the given key
Examples:- Word count using map reduce
Output: for each word w in input line output
Input: (2133, The quick brown fox jumps over the lazy dog)
Output: (the,1), (quicker,1),(brown,1)——(fox,1), (the,1),
Output: sum all values from the input for the given key input list of values and out
- Practitioner partitions the key space.
- Partition controls the partitioning of the keys of the intermediate map outputs.
- The key is used to derive the partition, typically by a hash function.
- The total no of partitions is the same as the number of reduce tasks for the job.
- Hence this control which of the reduce tasks, the intermediate key is sent for reduction.
- It is the default practitioner.
- If we are not going to specify anything in our map reduce program automatically Hash practitioner will execute the practitioner functionality i.e all the values which are associated with a particular key will be sent to the single reducer.
- There is no duplication of a key in the part-r-file.
- Custom practitioner only needs to implement 2 functions
- The former users use the Hadoop configuration to configure the partitions and the latest returns an integer b/w the no. if reducer tasks indexing to which the reducer pair will be sent,
- Between the map and reduce stages, a mapped application takes the o/p of the mapper and distribute the results among the reduce tasks.
- The process is called shift ling because the o/p of a mapper on a single node may be sent to reduces across.
Public class my practitioner implements partitioner
Int get partition(Text key, text value, int num partitions)
Int hash code=key. hash code();
Int. partition = hash code mod num partitions;
Rehim partition index;
- Combiner functionality will execute the map-reduce framework.
- Combiner will reduce the amount of intermediate data before sending them to the reducers.
- Combiner will call when the minimum split size is equal to 3 or>3, then combiner will call the reducer functionality and it will be executed on the single node.
- Combiner will reduce the amount of data traversed from one node to the other node.
- In this case, the reducer functionality will execute two times.
- The o/p of mapper functionality is equal to 16MB i.e one split.
Example:- Public class my combiner extends map Reduce base implements
Should be the Same type as map output key/value
Public void reduce(Text key, Iteratorvalue
OutputCollector output, Reporter reporter)
Record Reader and Record writer:-
- Record Reader Will read the data from input splits
- Record Writer will write the data to an output file from reducer o/p.
Explore MapReduce Sample Resumes! Download & Edit, Get Noticed by Top Employers!Download Now!
List of Other Big Data Courses: