MapReduce Interview Questions

Q. What is Hadoop Map Reduce?
For processing large data sets in parallel across a hadoop cluster, Hadoop MapReduce framework is used.  Data analysis uses a two-step map and reduce process.
Mapreduce is a framework for processing  big data (huge data sets using a large number of commodity computers). It processes the data in two phases namely Map and Reduce phase. This programming model is inherently parallel and can easily process large-scale data with the commodity hardware itself.
It is highly integrated with hadoop distributed file system for processing distributed across data nodes of clusters.
Q. How Hadoop MapReduce works?
In MapReduce, during the map phase it counts the words in each document, while in the reduce phase it aggregates the data as per the document spanning the entire collection. During the map phase the input data is divided into splits for analysis by map tasks running in parallel across Hadoop framework.
Q. Explain what is shuffling in MapReduce?
The process by which the system performs the sort and transfers the map outputs to the reducer as inputs is known as the shuffle
Q. Explain what is distributed Cache in MapReduce Framework?
Distributed Cache is an important feature provided by map reduce framework. When you want to share some files across all nodes in Hadoop Cluster, DistributedCache  is used.  The files could be an executable jar files or simple properties file.
Q. Explain what is NameNode in Hadoop?
NameNode in Hadoop is the node, where Hadoop stores all the file location information in HDFS (Hadoop Distributed File System).  In other words, NameNode is the centrepiece of an HDFS file system.  It keeps the record of all the files in the file system, and tracks the file data across the cluster or multiple machines
Q. Explain what is JobTracker in Hadoop? What are the actions followed by Hadoop?
In Hadoop for submitting and tracking MapReduce jobs,  JobTracker is used. Job tracker run on its own JVM process
Hadoop performs following actions in Hadoop
1. Client application submit jobs to the job tracker
2. JobTracker communicates to the Namemode to determine data location
3. Near the data or with available slots JobTracker locates TaskTracker nodes
4. On chosen TaskTracker Nodes, it submits the work
5. When a task fails, Job tracker notify and decides what to do then.
6. The TaskTracker nodes are monitored by JobTracker
Q. Explain what is heartbeat in HDFS?
Heartbeat is referred to a signal used between a data node and Name node, and between task tracker and job tracker, if the Name node or job tracker does not respond to the signal, then it is considered there is some issues with data node or task tracker
Q. Explain what combiners is and when you should use a combiner in a MapReduce Job?
To increase the efficiency of MapReduce Program, Combiners are used.  The amount of data can be reduced with the help of combiner’s that need to be transferred across to the reducers. If the operation performed is commutative and associative you can use your reducer code as a combiner. The execution of combiner is not guaranteed in Hadoop
Q. What happens when a datanode fails?
When a datanode fails
1. Jobtracker and namenode detect the failure
2. On the failed node all tasks are re-scheduled
3. Namenode replicates the users data to another node
Q. Explain what is Speculative Execution?
In Hadoop during Speculative Execution a certain number of duplicate tasks are launched.  On different slave node, multiple copies of same map or reduce task can be executed using Speculative Execution. In simple words, if a particular drive is taking long time to complete a task, Hadoop will create a duplicate task on another disk.  Disk that finish the task first are retained and disks that do not finish first are killed.
Q. Explain what are the basic parameters of a Mapper?
The basic parameters of a Mapper are
1. LongWritable and Text
2. Text and IntWritable
Q. Explain what is the function of MapReducer partitioner?
The function of MapReducer partitioner is to make sure that all the value of a single key goes to the same reducer, eventually which helps evenly distribution of the map output over the reducers
Q. Explain what is difference between an Input Split and HDFS Block?
Logical division of data is known as Split while physical division of data is known as HDFS Block
Q. Explain what happens in textinformat?
In textinputformat, each line in the text file is a record.  Value is the content of the line while Key is the byte offset of the line. For instance, Key: longWritable, Value: text
Q. Mention what are the main configuration parameters that user need to specify to run Mapreduce Job?
The user of Mapreduce framework needs to specify
1. Job’s input locations in the distributed file system
2. Job’s output location in the distributed file system
3. Input format
4. Output format
5. Class containing the map function
6. Class containing the reduce function
7. JAR file containing the mapper, reducer and driver classes
Q. Explain what is WebDAV in Hadoop?
To support editing and updating files WebDAV is a set of extensions to HTTP.  On most operating system WebDAV shares can be mounted as filesystems , so it is possible to access HDFS as a standard filesystem by exposing HDFS over WebDAV.
Q. Explain what is sqoop in Hadoop?
To transfer the data between Relational database management (RDBMS) and Hadoop HDFS a tool is used known as Sqoop. Using Sqoop data can be transferred from RDMS like MySQL or Oracle into HDFS as well as exporting data from HDFS file to RDBMS
Q. Explain how JobTracker schedules a task?
The task tracker send out heartbeat messages to Jobtracker usually every few minutes to make sure that JobTracker is active and functioning.  The message also informs JobTracker about the number of available slots, so the JobTracker can stay upto date with where in the cluster work can be delegated
Q. Explain what is Sequencefileinputformat?
Sequencefileinputformat is used for reading files in sequence. It is a specific compressed binary file format which is optimized for passing data between the output of one MapReduce job to the input of some other MapReduce job.
Q. Explain what does the conf.setMapper Class do?
Conf.setMapperclass  sets the mapper class and all the stuff related to map job such as reading data and generating a key-value pair out of the mapper
Q. What is YARN?
YARN stands for Yet Another Resource Negotiator which is also called as Next generation Mapreduce or Mapreduce 2 or MRv2.
It is implemented in hadoop 0.23 release to overcome the scalability short come of classic Mapreduce framework by splitting the functionality of Job tracker in Mapreduce frame work into Resource Manager and Scheduler.
Q. What is data serialization?
Serialization is the process of converting object data into byte stream data for transmission over a network across different nodes in a cluster or for persistent data storage.
Q. What is deserialization of data?
Deserialization is the reverse process of serialization and converts byte stream data into object data for reading data from HDFS. Hadoop provides Writables for serialization and deserialization purpose.
Q. What are the Key/Value Pairs in Mapreduce framework?
Mapreduce framework implements a data model in which data is represented as key/value pairs. Both input and output data to mapreduce framework should be in key/value pairs only.
Q. What are the constraints to Key and Value classes in Mapreduce?
Any data type that can be used for a Value field in a mapper or reducer must implement Interface to enable the field to be serialized and deserialized.
By default Key fields should be comparable with each other.  So, these must implement hadoop’ Interface which in turn extends hadoop’s Writable interface and java.lang.Comparableinterfaces.
Q. What are the main components of Mapreduce Job?
Main driver class which provides job configuration parameters.
Mapper class which must extend org.apache.hadoop.mapreduce.Mapper class and provide implementation for map () method.
Reducer class which should extend org.apache.hadoop.mapreduce.Reducer class.
Q. What are the Main configuration parameters that user need to specify to run Mapreduce Job?
On high level, the user of mapreduce framework needs to specify the following things:
The  job’s input location(s) in the distributed file system.
The  job’s output location in the distributed file system.
The input format.
The output format.
The class containing the map function.
The class containing the reduce function but it is optional.
The JAR file containing the mapper and reducer classes and driver classes.
Q. What are the main components of Job flow in YARN architecture?
Mapreduce job flow on YARN involves below components.
A Client node, which submits the Mapreduce job.
The YARN Resource Manager, which allocates the cluster resources to jobs.
The YARN Node Managers, which launch and monitor the tasks of jobs.
The MapReduce Application Master, which coordinates the tasks running in the MapReduce job.
The HDFS file system  is used for sharing job files between the above entities.
Q. What is the role of Application Master in YARN architecture?
Application Master performs the role of negotiating resources from the Resource Manager and working with the Node Manager(s) toexecute and monitor the tasks.
Application Master requests containers for all map tasks and reduce tasks.Once Containers are assigned to tasks, Application Master starts containers by notifying its Node Manager. Application Master collects progress information from all tasks and aggregate values are propagated to Client Node or user.
Application master is specific to a single application which is a single job in classic mapreduce or a cycle of jobs. Once the job execution is completed, application master will no longer exist.
Q. What is identity Mapper?
Identity Mapper is a default Mapper class provided by hadoop. When no mapper is class is specified in Mapreduce job, then this mapper will be executed.
It doesn’t process/manipulate/ perform any computation on input data rather it simply writes the input data into output. It’s class name is org.apache.hadoop.mapred.lib.IdentityMapper.
Q. What is identity Reducer?
It is a reduce phase’s counter part for Identity mapper in map phase. It simply passes on the input key/value pairs into output directory. Its class name is org.apache.hadoop.mapred.lib.IdentityReducer.
When no reducer class is specified in Mapreduce job, then this class will be picked up by the job automatically.
Q. What is chain Mapper?
Chain Mapper class is a special implementation of Mapper class through which a set of mapper classes can be run in a chain fashion, within a single map task.
In this chained pattern execution, first mapper output will become input for second mapper and second mappers output to third mapper, and so on until the last mapper.
Its class name is org.apache.hadoop.mapreduce.lib.ChainMapper.
Q. What is chain reducer?
Chain reducer is similar to Chain Mapper class through which a chain of mappers followed by a single reducer can be run in a single reducer task. Unlike Chain Mapper, chain of reducers will not be executed in this, but chain of mappers will be run followed by a single reducer.
Its class name is org.apache.hadoop.mapreduce.lib.ChainReducer.
Q. How can we mention multiple mappers and reducer classes in Chain Mapper or Chain Reducer classes?
In Chain Mapper,
ChainMapper.addMapper() method is used to add mapper classes.
In ChainReducer,
ChainReducer.setReducer() method is used to specify the single reducer class.
ChainReducer.addMapper() method can be used to add mapper classes.
Q. What is side data distribution in Mapreduce framework?
The extra read-only data needed by a mapreduce job to process the main data set is called as side data.
There are two ways to make side data available to all the map or reduce tasks.
Job Configuration
Distributed cache
Q. How to distribute side data using job configuration?
Side data can be distributed by setting an arbitrary key-value pairs in the job configuration using the various setter methods onConfiguration object.
In the task, we can retrieve the data from the configuration returned by Context ’s
getConfiguration() method.
Q. When can we use side data distribution by Job Configuration and when it is not supposed?
Side data distribution by job configuration is useful only when we need to pass a small piece of meta data to map/reduce tasks.
We shouldn’t use this mechanism for transferring more than a few KB’s of data because it put pressure on the memory usage, particularly in a system running hundreds of jobs.
Q. What is Distributed Cache in Mapreduce?
Distributed cache mechanism is an alternative way of side data distribution by copying files and archives to the task nodes in time for the tasks to use them when they run.
To save network bandwidth, files are normally copied to any particular node once per job.
Q. How to supply files or archives to mapreduce job in distributed cache mechanism?
The files that need to be distributed can be specified as a comma-separated list of URIs as the argument to the -files option in hadoop job command. Files can be on the local file system, on HDFS.
Archive files (ZIP files, tar files, and gzipped tar files) can also be copied to task nodes by distributed cache by using -archives option.these are un-archived on the task node.
The -libjars option will add JAR files to the classpath of the mapper and reducer tasks.
jar command with distributed cache
12 $ hadoop jar example.jar ExampleProgram -files Inputpath/example.txt input/filename /output/
Q. How distributed cache works in Mapreduce Framework?
When a mapreduce job is submitted with distributed cache options, the node managers copies the the files specified by the -files , -archives and -libjars options from distributed cache to a local disk. The files are said to be localized at this point.
local.cache.size property can be configured to setup cache size on local disk of node managers. Files are localized under the${hadoop.tmp.dir}/mapred/local directory on the node manager nodes.
Q. What will hadoop do when a task is failed in a list of suppose 50 spawned tasks?
It will restart the map or reduce task again on some other node manager and only if the task fails more than 4 times then it will kill the job. The default number of maximum attempts for map tasks and reduce tasks can be configured with below properties in mapred-site.xml file.
The default value for the above two properties is 4 only.
Q. Consider case scenario: In Mapreduce system, HDFS block size is 256 MB and we have 3 files of size 256 KB, 266 MB and 500 MB then how many input splits will be made by Hadoop framework?
Hadoop will make 5 splits as follows
– 1 split for 256 KB file
– 2 splits for 266 MB file  (1 split of size 256 MB and another split of size 10 MB)
– 2 splits for 500 MB file  (1 Split of size 256 MB and another of size 244 MB)
Q. Why can’t we just have the file in HDFS and have the application read it instead of distributed cache?
Distributed cache copies the file to all node managers at the start of the job. Now if the node manager runs 10 or 50 map or reduce tasks, it will use the same file copy from distributed cache.
On the other hand, if a file needs to read from HDFS in the job then every map or reduce task will access it from HDFS and hence if a node manager runs 100 map tasks then it will read this file 100 times from HDFS. Accessing the same file from node manager’s Local FS is much faster than from HDFS data nodes.
Q. What mechanism does Hadoop framework provides to synchronize changes made in Distribution Cache during run time of the application?
Distributed cache mechanism provides service for copying just read-only data needed by a mapreduce job but not the files which can be updated. So, there is no mechanism to synchronize the changes made in distributed cache as changes are not allowed to distributed cached files.


Get Updates on Tech posts, Interview & Certification questions and training schedules