Hadoop Mapreduce Architecture Overview
- Each node is part of an HDFS CLUSTER.
- Input data is stored in HDFS Spread across nodes and replicated.
- Programmer submits job (mapper, reducer, input) to job tracher.
- Job tracker- Master
Splits input data
Schedules and monitors various map and reduce tasks
Execute map and reduce tasks
MAPREDUCE PROGRAMMING MODEL:
- A programming model is designed by Google, by using which a sub set of distributed computing problems can be solved by writing simple programs.
- It provides automatic data distribution and aggregation.
- A simple and powerful interface that enables automatic parallelism and distribution of large–scale compotators, combined with an implementation of this interface that archives high performance on large clusters of commodity PC’s.
- It partitions the input data and schedules exaction across a set of machines.
- Handles machine failure and manages inter-process communication.
- Computation of key value pair from each piece of input.
- Grouping of intermediate value by key
- Iteration over resulting group
- MapReduce works on divide and conquer rule on the data.
- The basic idea is to partition a large problem into smaller.
- They can be tackled in parallel by different workers and intermediate results from works are combined to produce the final result.
- MapReduce is executed in two main phases, called map and reduce.
- Each phase B is defined by a data processing function and these functions are called map() and reduce()
- In map phase, MR takes the input data and feeds each data element into mapper.
- The Reducer process all output from the mapper and arrives at the final output.
- In order for mapping, reducing, partitioning and shuffling to seamlessly work together, we need to agree on a common structure for data being processed.
- It should be flexible and powerful enough to handle most of the target data processing application.
- MapReduce use list and pair as its fundamental primitive data structure.
- Mapreduce job can run with a single method called submit() or wait for Job completion()
- If the property mapped. Job. Tracker is set to local, the job will run in a single JVM and we can specify the host and port number while running on the cluster.
There are 2 types of Map Reduces
1. Classic Map Reduce or MRV1
2. YARN (Yet Another Resource Negotiator)
YARN (Map Reduce 2):-
- If the cluster size reaches 4000 nodes or more, there will be a scalability bottleneck.
- If the cluster B is more, then the job tracker cannot handle the job scheduling and task monitoring.
- YARN separates these two roles into two independent daemons.
Resource Manager: It manages the use of resources across the cluster.
An Application Master:-To manage the life cycle of applications running on the cluster.
Node Master:-Run on the cluster nodes which makes sure that the application does not use more resources than it has been allocated.
- Contrast to the Job tracker, In YARN, there is an application master for every Map Reduce Job run.
- If Application Master fails, the resource manager won’t get heart beat messages and will start a new instance of the master.
- If the resource manager is failed, the administrator will start the new instance of a resource manager and will recover from the saved state.
- Additional Daemon for YARN Architecture B History server.
- To serve the mapper, the class implements the mapper inter face and inherits the mapreduce class.
- The mapreduce class is the base class for both mapper and reduces.
It includes two methods.
- Function called void config(Job con fig), in this function, you can extract the parameters set either by the xml files or main class of your application and calls.
- Void close() as the last terminates this function and wraps up the loose ends if any i.e. Cpmputing Data box connections/open the files and so on.
- The mapper interface is responsible for data processing steps if we utilize the mapper where the key classes and value classes implement the writable –comparable writable interfaces respectively. A single method process the individual key value pairs.
Types of mapper are
1) Identity mapper Implements mapper and maps i/p s directly to o/p s
2) Inverse mapper implements mapper and reverses the key value pair,
3) Regex mapper implements mapper and generates a (match, 1) pair for every regular expression match.
4) Token count mapper It implements mapper and generates(Token,1) pair when the input is tokenized.
- As with any mapper implementation, a reducer must first extend the map reduce base to allow the configuration and clean up
- In addition, it must also implement the reducer interface when the reducer task receives the o/p from the various mappers.
- It sorts the incoming data on the [key, value] pair and groups together all values of the same key, then such reducer() is called and it generates pairs by iterating over the values associated with a given key.
- The output collector retrieves the o/p of a reducer process and writes into o/p file.
- The reporter provides an option to record extra information about the reducer and the task processes.
Types of Reduces:-
1) Identify Reducer:- It implements a reducer[key,value] and map inputs directly to the outputs.
2)Long sum reducer It implements a reducer[key, long writable,] to get the given key
Examples:- Word count using map reduce
Out put: for each word w in input line out put
Input: (2133,Tge quick brown fox jumps over the logy dog)
Output: (the,1), (quicker,1),(brown,1)——(fox,1), (the,1),
Output: sum all values from input for the given key input list of values and out
- Practitioner partitions the key space.
- Partition controls the partitioning of the keys of the intermediate map outputs.
- The key is used to derive the partition, typically by a hash function.
- The total no of partitions is the same as the number of reduce tasks for the job.
- Hence this control which of the reduce tasks, the intermediate key is sent for reduction.
Hash practitioner:- It is the default practitioner.
- If we are not going to specify anything in our map reduce program automatically Hash practitioner will execute the practitioner functionality i.e all the values which are associated with a particular key will be sent to the single reducer.
- There is no duplication of key in the part-r-file.
- Custom practitioner only needs to implement 2 functions
- The former users uses the Hadoop configuration to configure the partitions and the latest returns an integer b/w the no., if reducer tasks indexing to which the reducer pair will be sent,
- Between the map and reduce stages, a mapped application takes the o/p of the mapper and distribute the results among the reduce tasks.
- The process is called shift ling because the o/p of a mapper on a single node may be sent to reduces across.
Public class my practitioner implements partitioner
Int get partition(Text key, text value, int num partitions)
Int hash code=key. hash code();
Int. partition = hash code mod num partitions;
Rehim partition index;
- Combiner functionality will execute the map reduce framework.
- Combiner will reduce the amount of intermediate data before sending them to the reducers.
- Combiner will call when the minimum split size is equal to 3 or>3, then combiner will call the reducer functionality and it will be executed on the single node.
- Combiner will reduce the amount of data traversed from one node to the other node.
- In this case, the reducer functionality will execute two times.
- The o/p of mapper functionality is equal to 16MB i.e one split.
Example:- Public class my combiner extends map Reduce base implements
Should be the Same type as map output key/value
Public void reduce(Text key, Iteratorvalue
OutputCollector output, Reporter reporter)
Record Reader and Record writer:-
- Record Reader Will read the data from input splits
- Record Writer will write the data to an output file from reducer o/p.