Overview of Hadoop MapReduce
MapReduce is a soft work framework for easily writing applications which process vast amounts of data (multi tera bytes) in- parallel on large clusters of commodity hard work in a reliable, fault- tolerant manner. (Learn about the basics of MapReduce in the column Mapreduce Tutorial)
Map Reduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner.
The framework sorts the outputs of the maps, which are then act as inputs to the reduce tasks.
Typically, both input and output of the job are stored in a file system and the framework takes care of scheduling tasks, monitoring them and re executes the failed tasks.
MapReduce framework consists of a single master Job tracker and one slave task tracker per cluster node.
The master is responsible for scheduling the jobs component tasks on the slaves, monitoring them and re-executing the failed tasks and the solves execute the tasks as directed by the master.
Input and output Formats
Hadoop can process many different types of data formats, from flat text files to databases. An input split is a chunk of the input that is processed by a single map and each map processes a single split.
In a database context, a split might correspond to a range of rows from a table and a record to a row in that range
An input format is responsible for creating the input split z and dividing them into records.
The file names which start with a and a are treated as hidden files and one must ignore them while reading as input files.
The default split size is the block size and can be calculated as max (minimum size, min (maximum size, block size))
Hadoop works better with a small number of large files than a large number of small files.
File input format generates splits in such a way that each split is part of a single file.
We can avoid the splitting of the file in 2 ways, either by increasing the block size as the largest file size or by implementing the is split table() method by returning false.
In the map method, we can have the file information using the get input split() of context and cast it to file split and use get path() method.
Below are the different types of input formats:
- Text input format:
- It is a default input format and record is a line of input.
- The key for each record is the byte offset from beginning of each line and value.
- The output format for this is the Text output format and its final output is a key-value pair which is delimited by a tab.
2. Key value text input format:
- If the output of the Text output format is sent to the i/p tab, we can specify the input format for this job, as this acts as a key value pair since the output is a delimited key – value pair.
- We can override the default delimiter using this property
map reduce. Input . key value line record reader. Key. Value separator
3. Nline input Format:
- This is used if we want to move each mapper to receive a fixed number of lines as input.
- Map reduce. Input. Line input format. Lines per map is set properly to the N value.
- This also works as a Text input format, but the difference is in the number of lines per input.
- Map Reduce has support for binary format as well
4. Sequence File input format:
- This file format shores sequences of binary key-value pairs.
- Sequence files are well suited as a format for Map reduce data since they are split table.
- To use the sequence file data as input as input to the map per, we need to mention input format as the sequence file input format.
- Need to mention the key-value data types as per the sequence file key and value types.
5. Sequence File As Text Input Format:
- This is like a sequence file input format, but it converts the sequence files, keys and values to text objects.
- The conversion is performed by calling to storing() on the keys and values.
6. Sequence file As Binary Input Format:
This is live sequence file input format that retrieves the sequence files, keys and values as opaque binary objects and they are encapsulated as Bytes writable objects.
7. DB Input Format:
It is used when reading the data from a relational database using JDBC.
8. Multiple inputs:
- This technique is used when we need to process the data which could be in the same format or in a different format but may have different representation.
- While using this, we need to mention a map per and input format for each input path.
Different output Formats:
1. Text output format:
- This is the default output format, the key and values are separated by tab delimiter.
- The delimiter can be changed using the property.
Map reduce. output. text output format. separator.
- We can support the key or value from the output using Null writable type.
2. Sequence file output format:
It writes sequence files for its output.
3. Sequence File As Binary Output format:
- It writes keys and values in raw binary format into a sequence file container.
- The output format classes generates set of files as output and one file for each reduces and their names are part-r-00000,part-r-00001 etc.
- If we want to write multiple output files for each reduce then we will use multiple output class
- This will generate one output file for each key in the reduces and the name can be pre fined with the key.
Data Types and custom writable:
- Hadoop comes with a large selection of writable classes in the org. a apache. Hadoop. io package
- Hadoop provides all writable wrappers for all the JAVA primiters types except char and hare a get() and set() method for retrieving and storing the data.
- Byte writable- byte, Boolean writable-boolean.
- Short writable-short, int writable and vint writable- int
- Float writable-float, double writable-double.
- Long writable and V long writable- long.
- When we have numeric’s, we can select either fined – length (Int writable and long writable) or variable length (V int writable and v long writable)
Text– It is equivalent to string in Java
Null writable: It is a special type of writable. And No bytes are written to. or read from the stream.
- In Map Reduce, a key or value can be declared as a Null writable when we don’t want to use this in the final output.
- It is an immutable single ton and the instance can be received by null writable. get()
- This will store an empty value in the output.
Writable collections:- There are six writable collection types in Hadoop.
- Array writable, TwoDArrayWritable, Array primitive writable.
- Map writable, StoredMapWritable and EnumSetWritable
- ArrayPrimitiveWritable is a wrapper for arrays of Java Primitives
- Array writable and TwoDArrayWritable are writable implementations for arrays and two – dimensional arrays (array of arrays) of writable instances.
- All the elements of an array writable or two D Array writable must be instances of the same class.
Custom Writable:- Instead of the existing writable classes, if we want to implement our own writable classes, then we can develop a custom writable by implementing a writable comparable interface.
This custom writable can be used as a data type in mapreduce program.