MapReduce is a soft work framework for easily writing applications which process vast amounts of data (multi terabytes) in- parallel on large clusters of commodity hard work in a reliable, fault-tolerant manner. (Learn about the basics of MapReduce in the column MapReduce Tutorial)
Map Reduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner.
The framework sorts the outputs of the maps, which then act as inputs to the reduce tasks.
Typically, both input and output of the job are stored in a file system and the framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
MapReduce framework consists of a single master Job tracker and one slave task tracker per cluster node.
The master is responsible for scheduling the jobs component tasks on the slaves, monitoring them and re-executing the failed tasks and the solves execute the tasks as directed by the master.
Hadoop can process many different types of data formats, from flat text files to databases. An input split is a chunk of the input that is processed by a single map and each map processes a single split.
In a database context, a split might correspond to a range of rows from a table and a record to a row in that range
An input format is responsible for creating the input split z and dividing them into records.
The file names which start with a and a are treated as hidden files and one must ignore them while reading as input files.
The default split size is the block size and can be calculated as max (minimum size, min (maximum size, block size))
Hadoop works better with a small number of large files than a large number of small files.
File input format generates splits in such a way that each split is part of a single file.
We can avoid the splitting of the file in 2 ways, either by increasing the block size as the largest file size or by implementing the is split table() method by returning false.
In the map method, we can have the file information using the get input split() of context and cast it to file split and use get path() method.
map reduce. Input . key value line record reader. Key. Value separator
This is live sequence file input format that retrieves the sequence files, keys and values as opaque binary objects and they are encapsulated as Bytes writable objects.
It is used when reading the data from a relational database using JDBC.
This is the default output format, the key and values are separated by tab delimiter.
The delimiter can be changed using the property.
Map reduce. output. text output format. separator.
We can support the key or value from the output using Null writable type.
It writes sequence files for its output.
Hadoop comes with a large selection of writable classes in the Org. an Apache. Hadoop. io package
Hadoop provides all writable wrappers for all the JAVA primiters types except char and hare a get() and set() method for retrieving and storing the data.
When we have numeric’s, we can select either fined – length (Int writable and long writable) or variable length (V int writable and v long writable)
It is equivalent to string in Java
It is a special type of writable. And No bytes are written to. or read from the stream.
• In Map Reduce, a key or value can be declared as a Null writable when we don’t want to use this in the final output.
• It is an immutable singleton and the instance can be received by null writable. get()
• This will store an empty value in the output.
There are six writable collection types in Hadoop.
• Array writable, TwoDArrayWritable, Array primitive writable.
• Map writable, StoredMapWritable and EnumSetWritable
• ArrayPrimitiveWritable is a wrapper for arrays of Java Primitives
• Array writable and TwoDArrayWritable are writable implementations for arrays and two – dimensional arrays (array of arrays) of writable instances.
• All the elements of an array writable or two D Array writable must be instances of the same class.
Instead of the existing writable classes, if we want to implement our own writable classes, then we can develop a custom writable by implementing a writable comparable interface.
This custom writable can be used as a data type in MapReduce program.
|Big Data On AWS||Informatica Big Data Integration|
|Bigdata Greenplum DBA||Informatica Big Data Edition|
|Hadoop Testing||Apache Mahout|
Free Demo for Corporate & Online Trainings.