Hadoop Installation and Configuration

What is Hadoop?

Hadoop is a globally-used, open source software programming framework which is based on Java programming with some native code of C and shell scripts. It can effectively manage large data, both structured and unstructured formats on clusters of computers using simple programming models. 

The Hadoop application framework provides computation across clusters of computers and distributed storage.Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage. Hadoop follows the master slave architecture, and it is described below. In order to get a clear understanding of the below section, you should first go through the link by clicking here.

Master/Slave Architecture:

In this architecture, the Master is either Namenode or JobTraker or both, and the Slave is multiple DataNodes and TaskTrakers ({DataNode, TaskTracker}, …..  {DataNode, TaskTracker})

Master/Slave Architecture

Map/Reduce and HDFS are the primary components of Hadoop cluster.

Interested in mastering Hadoop? Enroll now for a FREE demo on Hadoop training

MapReduce: MapReduce is a programming model associated for implementation by generating and processing big data sets with parallel and  distributed algorithms on a cluster. The following is the Map/Reduce Master-slave architecture.

Master: JobTraker

Slaves: {tasktraker}……{Tasktraker}

  • Master {Jobtracker} is the interaction point for map/reduce framework and users. The Jobtraker arranges the pending jobs in a queue for execution in FIFO(first-come-first-serve) format to reduce tasks to the tasktrackers, and also manages the mapping.
  • Slaves {tasktraker} execute tasks based on the instructions of Master {Jobtraker} and handles data exchange between map and reduce.

HDFS (Hadoop distributed file system): HDFS is a part of Apache Software Foundation designed to support a fault-tolerant file system that can run on any hardware commodity.

The following is the HDFS Master-slave architecture.

Master: NameNode

Slave: {Datanode}…..{Datanode}

  • The Master {NameNode} manages the file system namespace operations such as renaming files, determines the mapping of blocks to DataNodes, directories, opening and closing files, and regulating access to files by client.
  • Slaves {DataNodes} serve read and write requests from the clients file system and perform all the tasks like replication, block creation, and  deletion based on the instruction of Master {NameNode}.

 MindMajix YouTube Channel

Core-site.xml and hdfs-site.xml configuration files:

This section of article describes how to edit and set up several deployment configuration files (Core-site.xml, hdfs-site.xml) for HDFS and MapReduce. 

Hortonworks will provide a set of configuration files that can be taken as reference but, they need to be modified according to our cluster environment to represent the working of MapReduce and Hadoop distributed file system configuration.

The following are the steps to configure files to set up HDFS and MapReduce environment:

  • Step:1 Extract the core Hadoop configuration files into a temporary directory. 
  • Step:2 The files are in the path: configuration_files/core_Hadoop directory where companion files are decompressed.
  • Step:3 Make necessary changes in the configuration files.
  • Step:4 In the temporary directory, locate the files and edit their properties based on your environment.
  • Step:5 Search for ToDo list in the files for the properties to replace.

Edit and modify the following core-site.xml properties according to your environment:

Modify and edit the following hdfs-site.xml properties according to your environment:

< property >
     < name >dfs.namenode.name.dir< /name >
     < value >/grid/Hadoop/hdfs/nn,/grid1/Hadoop/hdfs/nn< /value >
     < description >Comma-separated list of paths. Use the list of directories from $DFS_NAME_DIR. For example, /grid/Hadoop/hdfs/nn,/grid1/Hadoop/hdfs/nn.< /description >
< /property >
< property >
     < name >dfs.datanode.data.dir< /name >
     < value >file:///grid/Hadoop/hdfs/dn, file:///grid1/Hadoop/hdfs/dn
     < description >Comma-separated list of paths. Use the list of directories from $DFS_DATA_DIR. For example, file:///grid/Hadoop/hdfs/dn, file:///grid1/ Hadoop/hdfs/dn.< /description >
< /property >

< property >
     < name >dfs.namenode.http-address< /name >
     < value >$namenode.full.hostname:50070< /value >
     < description >Enter your NameNode hostname for http access.< /description >
< /property > 

< property >
     < name >dfs.namenode.secondary.http-address< /name >
     < value >$secondary.namenode.full.hostname:50090< /value >
     < description >Enter your Secondary NameNode hostname.< /description >
< /property >

< property >
     dfs.namenode.checkpoint.dir
     < value >/grid/Hadoop/hdfs/snn,/grid1/Hadoop/hdfs/snn,/grid2/Hadoop/hdfs/snn
     < description >A comma-separated list of paths. Use the list of directories from $FS_CHECKPOINT_DIR. For example, /grid/Hadoop/hdfs/snn,sbr/grid1/Hadoop/hdfs/ snn,sbr/grid2/Hadoop/hdfs/snn < /description >
< /property >

< property >
     < name > dfs.namenode.checkpoint.edits.dir< /name >
     /grid/Hadoop/hdfs/snn,/grid1/Hadoop/hdfs/snn,/grid2/Hadoop/hdfs/snn
     A comma-separated list of paths. Use the list of directories from $FS_CHECKPOINT_DIR. For example, /grid/Hadoop/hdfs/snn,sbr/grid1/Hadoop/hdfs/ snn,sbr/grid2/Hadoop/hdfs/snn 
< /property >

< property >
     < name >dfs.namenode.rpc-address< /name >
     < value>namenode_host_name:8020>
     < description >The RPC address that handles all clients requests.< /description>< /value>< /property>

< property >
     < name >dfs.namenode.https-address< /name >
     < value >namenode_host_name:50470>
     The namenode secure http server address and port.
< /description > /value >< /property >

Edit and modify the following yarn-site.xml properties according to your environment:

< property >
 < name >yarn.resourcemanager.scheduler.class< /name >org.apache.Hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler< /value >
< /property >

< property >
     < name >yarn.resourcemanager.resource-tracker.address< /name >
     < value >$resourcemanager.full.hostname:8025< /value >
     < description >Enter your ResourceManager hostname.< /description >
< /property >

< property >
     < name >yarn.resourcemanager.scheduler.address< /name >
     < value >$resourcemanager.full.hostname:8030< /value >
      < description >Enter your ResourceManager hostname.< /description >
< /property >

< property >
     < name >yarn.resourcemanager.address< /name >
     < value >$resourcemanager.full.hostname:8050< /value >
     < description >Enter your ResourceManager hostname.< /description >
< /property >

< property >
     < name >yarn.resourcemanager.admin.address< /name >
     < value >$resourcemanager.full.hostname:8141< /value >
     < description >Enter your ResourceManager hostname.< /description >
< /property >

< property >
     < name >yarn.nodemanager.local-dirs< /name >
     < value >/grid/Hadoop/yarn/local,/grid1/Hadoop/yarn/local< /value >
     < description >Comma separated list of paths. Use the list of directories from $YARN_LOCAL_DIR.For example, /grid/Hadoop/yarn/local,/grid1/Hadoop/yarn/ local.
< /property >

< property >
     < name >yarn.nodemanager.log-dirs< /name >
     < value >/grid/Hadoop/yarn/log< /value >
     < description >Use the list of directories from $YARN_LOCAL_LOG_DIR. For example, /grid/Hadoop/yarn/log,/grid1/Hadoop/yarn/ log,/grid2/Hadoop/yarn/log< /description >
< /property >

< property >
     < name >yarn.nodemanager.recovery< /name.dir >
     < value >{Hadoop.tmp.dir}/yarn-nm-recovery< /value >
< /property >

< property >
     < name >yarn.log.server.url< /name >
     < value >http://$jobhistoryserver.full.hostname:19888/jobhistory/logs/< / value >
     < description >URL for job history server< /description >
< /property >

< property >
     yarn.resourcemanager.webapp.address< /name >
     < value >$resourcemanager.full.hostname:8088< /value >
     < description >URL for job history server< /description>
 < /property >


    < name >yarn.timeline-service.webapp.address< /name >
    < value >< Resource_Manager_full_hostname >:8188< /value >
< /property >

Edit and modify the following mapred-site.xml properties according to your environment:

< property >
     < name >mapreduce.jobhistory.address< /name >
     < value >$jobhistoryserver.full.hostname:10020< /value>
     < description >Enter your JobHistoryServer hostname.< /description >
< /property >
< property >
     < name >mapreduce.jobhistory.webapp.address< /name >
     < value >$jobhistoryserver.full.hostname:19888< /value > < description >Enter your JobHistoryServer hostname< /description >
< /property >

On each node of the cluster, create an empty file named dfs.exclude inside $Hadoop_CONF_DIR.

Append the following to /etc/profile:
touch $Hadoop_CONF_DIR/dfs.exclude
JAVA_HOME=
export JAVA_HOME
Hadoop_CONF_DIR=/etc/Hadoop/conf/
export Hadoop_CONF_DIR
export PATH=$PATH:$JAVA_HOME:$Hadoop_CONF_DIR

It is optional to Configure MapReduce to use Snappy Compression feature. To enable this feature for core-site.xml, mapred-site.xml, and MapReduce jobs, add the properties listed below according to your environment requirements:

< property >
    < name >mapreduce.admin.map.child.java.opt< /name >
     < value >-server -XX:NewRatio=8 -Djava.library.path=/usr/hdp/current/Hadoop/lib/native/ -Djava.net.preferIPv4Stack=true< /value >
     true

< property >
     < name >mapreduce.admin.reduce.child.java.opts< /name >
     < value >-server -XX:NewRatio=8 -Djava.library.path=/usr/hdp/current/Hadoop/lib/native/ -Djava.net.preferIPv4Stack=true
     < final >true< /final >
< /property >
Add the SnappyCodec to the codecs list in core-site.xml:
< property >
    < name >io.compression.codecs< /name >
< value >org.apache.Hadoop.io.compress.GzipCodec,org.apache.Hadoop.io.compress.DefaultCodec,org.apache.Hadoop.io.compress.SnappyCodec< /value >
< /property >

Defining HDFS Details in hdfs-site.xml:

The following table will briefly explain the contents of hdfs-site.xml file and the configuration settings for HDFS daemons; the Secondary NameNode, DataNodes, and the NameNode.

 PROPERTYVALUEDESCRIPTION
 dfs.data.dir 
 /disk1/hdfs/data,
/disk1/hdfs/data

All the directories where data nodes stores blocks, each block is stored in one of these directories

${hadoop.tmp.dir}/dfs/data

fs.checkpoint.dir
/disk1/hdfs/namesecondary,
/disk1/hdfs/namesecondary
A list of directories where secondary name nodes stores checkpoints. It stores a copy of the checkpoint in each directory in the list
${hadoop.tmp.dir}/dfs/namesecondary

Mapred-site.xml:

The following table describes the version of mapred-site.xml and configuration settings for MapReduce daemons; the master 

Frequently Asked Hadoop Interview Questions

{job tracker} and the slave {task-trackers} and properties to be mentioned according to our environment:

Mapred-site.xml


 
 < property >
          < name >mapred.job.tracker< /name >
          < value >localhost:8021< /value >
< /property >
< /configuration >

Defining mapred-site.xml: It contains the configuration settings that are common for both HDFS and MapReduce like I/O settings.

PropertyValueDescription
mapred.job.trackerlocalhost:8021The hostname and the port that the jobtracker RPC server runs on. If set to default value of local, then the jobtracker runs in process on demand when a mapreduce job.
mapred.local.dir${hadoop.tmp.dir}/mapred/localA list of directories where Mapreduce intermediate data for jobs.
mapred.system.dir${hadoop.tmp.dir}/mapred/systemThe directory relative to fs.default.name where shared files are stored, during a job run.
mapred.tasktracker.map.tasks.maximum2The number of map tasks that can run at a time.
mapred.tasktracker.reduce.tasks.maximum2The number of reduce tasks that can run at a time.

Installing Java

You can go through the steps to install Java and Hadoop here.

Pre-installation Setup

Before installing Hadoop, we need to know the prerequisites for installation. The following are the prerequisites:

  • Memory
  • Processor model and speed
  • Operating system and Network

Memory: Minimum 8GB RAM is required.

Processor model and speed: quad, hexa, or octa core processor with 2-2.5 GHz speed.

Operating system and Network requirements: Installing Hadoop with Linux is better rather than on Windows, and for learning purpose, install Hadoop in pseudo-distributed mode. The advantage of using pseudo-distributed mode is that the cluster can be controlled through a single system.

Hadoop in Linux environment

Before installing Hadoop into the Linux environment, we need to follow the steps listed below to set up Secure Shell in Linux environment.

Step 1: Creating a User

Create a separate user for Hadoop to differentiate Unix file system and Hadoop file system in kb. Follow the steps to create a user:

  • In the command prompt, open the root with the command “su”.
  • Create a user with the command “useradd name” (username).
  • After creating user account, Open an existing user account using the command “su name”.

EXAMPLE: Open the Linux command prompt and type the following commands:

$ su 
   password: 
# useradd Hadoop 
# passwd Hadoop 
   New passwd: 
   Retype new passwd 

SSH Setup and Key Generation

SSH setup is required to perform different operations on a cluster like distributed daemon shell operations, starting and stopping operations. To authenticate various users of Hadoop, it is must to provide private or public key pair for a Hadoop user to share it with different users.

Check Out Hadoop Tutorials

The public keys are copied from id_rsa.pub used as authorized_keys and owner provides read and write permissions to authorized_keys

The following is the shell script code for public or private key generation:

$ ssh-keygen -t rsa 
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 
$ chmod 0600 ~/.ssh/authorized_keys

Installing and configuration of Hadoop in Standalone Mode Setup

The Following are the steps to install Hadoop 2.4.1 in pseudo distributed mode.

Step 1 −  Extract all downloaded files:

The following command is used to extract files on command prompt:

Command: cd Downloads

Step 2 − Create soft links (shortcuts).

The following command is used to create shortcuts:

Command: ln -s ./Downloads/hadoop-2.7.2/ ./hadoop

Step 3 − Configure .bashrc

This following code is to modify PATH variable in bash shell.

Command:  vi ./.bashrc

The following code exports variables to path :

export HADOOP_HOME=/home/luck/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Step 4 − Configure Hadoop in Stand-alone mode:

The following command is used to Configure Hadoop’s hadoop-env.sh file

Command:  vi ./hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/home/luck/jdk

Step 5 − Exit and re-open the command prompt

Step 6 − Run a Hadoop job on Standalone cluster

To run hadoop test the hadoop command. The usage message must be displayed.

Step 7 − Go to the directory you have downloaded the compressed Hadoop file and unzip using terminal

Command: $ tar -xzvf hadoop-2.7.3.tar.gz

Step 8 − Go to the Hadoop distribution directory. 

Command: $ cd /usr/local/hadoop

FInal output after installation and configuration:

By default, Hadoop is configured in standalone mode and is run in a non-distributed mode on a single physical system. If setup is installed and configured properly, then the following result is displayed on the command prompt:

Hadoop 2.4.1 
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768 
Compiled by hortonmu on 2013-10-07T06:28Z 
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4

The above result shows that Hadoop standalone mode setup is working. 

The following Xml files must be reconfigured in order to develop Hadoop in Java:

Core-site.xml
Mapred-site.xml
Hdfs-site.xml

Core-site.xml:

The core-site.xml file contains information regarding memory allocated for the file system, the port number used for Hadoop instance, size of Read/Write buffers, and memory limit for storing the data.

Open the core-site.xml with the following command and add the properties listed below in between, tags in this file.

Command: ~$ sudo gedit $HADOOP_HOME/etc/hadoop/core-site.xml

< property >
< name >hadoop.tmp.dir< /name >
< value >/app/hadoop/tmp< /value >
< description >Parent directory for other temporary directories.< /description >
< /property >
< property >
< name >fs.defaultFS < /name >
< value >hdfs://localhost:54310< /value >
< description >The name of the default file system. < /description >
< /property >

Mapred-site.xml: 

This Mapred-site.xml file is used to specify the MapReduce framework currently in use.

Open mapred-site.xml file with the following command and add the following properties in between the , tags in this file.

Command: ~$ sudo gedit $HADOOP_HOME/etc/hadoop/mapred-site.xml

< property >
< name >mapreduce.jobtracker.address< /name >
< value >localhost:54311< /value >
< description >MapReduce job tracker runs at this host and port.
< /description >
< /property >

Hdfs-site.xml:

The hdfs-site.xml file contains information regarding the namenode path, datanode paths of the local file systems, the value of replication data, etc. It means the place where you want to store the

Hadoop infrastructure.

Open the Hdfs-site.xml file with following command and add the properties listed below in between the tags.

Command: ~$ sudo gedit $HADOOP_HOME/etc/hadoop/hdfs-site.xml

< property >
< name >dfs.replication< /name >
< value >1< /value >
< description >Default block replication.< /description >
< /property >
< property >
< name >dfs.datanode.data.dir
< value >/home/hduser_/hdfs< /value >
< /property >

Installing and configuration of Hadoop in Pseudo Distributed Mode

The Following are the steps to install Hadoop 2.4.1 in pseudo distributed mode.

Step 1 − Setting Up Hadoop

Hadoop environment variables are appended into the following commands to file (~/.bashrc):

export Hadoop_HOME=/usr/local/Hadoop 
export Hadoop_MAPRED_HOME=$Hadoop_HOME 
export Hadoop_COMMON_HOME=$Hadoop_HOME 
export Hadoop_HDFS_HOME=$Hadoop_HOME 
export YARN_HOME=$Hadoop_HOME 
export Hadoop_COMMON_LIB_NATIVE_DIR=$Hadoop_HOME/lib/native 
export PATH=$PATH:$Hadoop_HOME/sbin:$Hadoop_HOME/bin 
export Hadoop_INSTALL=$Hadoop_HOME 

Apply the changes to the system currently running.

Command: $ source ~/.bashrc 

Step 2 − Hadoop Configuration

Hadoop configuration files are located at “$Hadoop_HOME/etc/Hadoop”. According to the Hadoop infrastructure, the changes are made to those configuration files  if required. 

Command: $ cd $Hadoop_HOME/etc/Hadoop

To develop Hadoop programing in Java, we need to change the Java variables in Hadoop-env.sh file by replacing JAVA_HOME value with the location of Java in the system.

Example:

Command: export JAVA_HOME=/usr/local/jdk1.7.0_71

The following Xml files must be reconfigured in order to develop Hadoop in Java:

  • Core-site.xml
  • Hdfs-site.xml
  • Yarn-site.xml
  • Mapred-site.xml

Hdfs-site.xml:

The hdfs-site.xml file contains information regarding the namenode path, datanode paths of the local file systems, the value of replication data, etc. It means the place where you want to store the

Hadoop infrastructure.

Let’s assume:

dfs.replication (data replication value) = 1 

namenode path = //home/Hadoop/Hadoopinfra/hdfs/namenode

(In the path mentioned above: /Hadoop/ is the user name.  Hadoopinfra/hdfs/namenode is the directory generated by hdfs file system.) 

datanode path = //home/Hadoop/Hadoopinfra/hdfs/datanode

(In the path mentioned above: Hadoopinfra/hdfs/datanode is the directory generated by hdfs file system.) 

Open the Hdfs-site.xml file and add the properties listed below in between the tags. In this file, all the property values are user-defined and can be changed according to the Hadoop infrastructure.

< configuration >
   < property >
      < name >dfs.replication< /name >
      < value >1< /value >
   < /property >
   < property >
      < name >dfs.name.dir< /name >
      < value >file:///home/Hadoop/Hadoopinfra/hdfs/namenode < /value>
   < /property >
   < property >
      < name >dfs.data.dir< /name > 
      file:///home/Hadoop/Hadoopinfra/hdfs/datanode < /value > 
   < /property >
< /configuration >

Yarn-site.xml:

Yarn-site.xml.template is a default template. This Yarn-site.xml file is used to configure yarn into Hadoop environment. 

Open the yarn-site.xml file and add the following properties in between the , tags in this file.

< configuration >
   < property >
      < name >yarn.nodemanager.aux-services< /name >
      < value >mapreduce_shuffle< /value > 
   < /property >
< /configuration >

site.xml file.

The following command is used to copy mapred-site.xml.template to mapred-site.xml file:

Command: $ cp mapred-site.xml.template mapred-site.xml 

Open mapred-site.xml file and add the following properties in between the , tags in this file.

    
      mapreduce.framework.name
      yarn
   

Installation and Configuring Hadoop in fully-distributed mode

To configure a Hadoop cluster in fully-distributed mode , we need to configure all the master and slave machines. Even though it is different from the pseudo-distributed mode, the configuration method will be same.

The following are steps to configure Hadoop cluster in fully-distributed mode:

Step 1 − Setting Up Hadoop environment variables

master and all the slaves have the same user and all nodes in the cluster as mentioned below: Hadoop environment variables are appended into the following commands to file (~/.bashrc) to Export variables

export  JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67
export  HADOOP_PREFIX=/home/impadmin/hadoop-2.6.4
export  PATH=$HADOOP_PREFIX/bin:$JAVA_HOME/bin:$PATH
export  HADOOP_COMMON_HOME=$HADOOP_ PREFIX
export  HADOOP_HDFS_HOME=$HADOOP_ PREFIX
export  YARN_HOME=$HADOOP_ PREFIX
export  HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export  YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop

Apply the changes to the system currently running.

Command: $ source ~/.bashrc 

Step 2: configuration

Add all the export commands listed below at start of script in Command  etc/hadoop/yarn-env.sh 

export  JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67
export  HADOOP_PREFIX=/home/impadmin/hadoop-2.6.4
export  PATH=$PATH:$HADOOP_PREFIX/bin:$JAVA_HOME/bin:.
export  HADOOP_COMMON_HOME=$HADOOP_ PREFIX
export  HADOOP_HDFS_HOME=$HADOOP_ PREFIX
export  YARN_HOME=$HADOOP_ PREFIX
export  HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export  YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop

Step 3: Create a folder for hadoop.tmp.dir

Create a temporary folder in HADOOP_PREFIX

Command mkdir -p $HADOOP_PREFIX/tmp

Step 4: Tweak config files

Add the all the properties mentioned below between configuration tag to all the machines in cluster under HADOOP_ PREFIX file ⇒ etc/hadoop folder:

The following Xml files must be reconfigured in order to develop Hadoop in Java:

  • Core-site.xml
  • Hdfs-site.xml
  • Yarn-site.xml
  • Mapred-site.xml

Core-site.xml 

The core-site.xml file contains information regarding memory allocated for the file system, the port number used for Hadoop instance, size of Read/Write buffers, and memory limit for storing the data.

Open the core-site.xml and add the properties listed below in between , tags in this file.

< property >
< name >fs.default.name< /name >
< value >hdfs://Master-Hostname:9000< /value>
< /property >
< property >
< name >hadoop.tmp.dir< /name >
< value >/home/impadmin/hadoop-2.6.4/tmp< /value >
< /property >

Hdfs-site.xml 

The hdfs-site.xml file contains information regarding the namenode path, datanode paths of the local file systems, the value of replication data, etc. It means the place where you want to store the Hadoop infrastructure.

Open the Hdfs-site.xml file and add the properties listed below in between the tags. In this file, all the property values are user-defined and can be changed according to the Hadoop infrastructure.

< property >
< name >dfs.replication< /name>
< value >2< /value>
< /property >
< property >
< name >dfs.permissions< /name >
< value >false< /value >
< /property >

Mapred-site.xml :

This Mapred-site.xml file is used to specify the MapReduce framework currently in use. Firstly, to replace default template, it is required to copy the file from mapred-site.xml.template to mapred-site.xml file.

Open the mapred-site.xml file and add the following properties in between the , tags in this file.

< property >
< name >mapreduce.framework.name< /name >
< value >yarn< /value >
< /property >

Yarn-site.xml :

Yarn-site.xml.template is a default template. This Yarn-site.xml file is used to configure yarn into Hadoop environment Remember to replace “Master-Hostname” with host name of cluster’s master.

Open the yarn-site.xml file and add the following properties in between the , tags in this file.

< property >
< name >yarn.nodemanager.aux-services< /name >
< value >mapreduce_shuffle< /value >
< /property >
< property >
< name >yarn.nodemanager.aux-services.mapreduce.shuffle.class< /name >
< value >org.apache.hadoop.mapred.ShuffleHandler< /value >
< /property >
< property >
< name >yarn.resourcemanager.resource-tracker.address< /name >
< value >Master-Hostname:8025< /value >
< /property >
< property >
< name >yarn.resourcemanager.scheduler.address< /name >
< value >Master-Hostname:8030< /value >
< /property >
< property >
< name >yarn.resourcemanager.address< /name >
< value >Master-Hostname:8040< /value >
< /property >

Verifying Hadoop Installation

By following these steps, we can verify the Hadoop installation.

Step 1: Setting up NameNode 

The following command is used for setting up NameNode “hdfs namenode -format”:

$ cd ~ 
$ hdfs namenode -format 

The expected result :

10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************ 
STARTUP_MSG: Starting NameNode 
STARTUP_MSG:   host = localhost/192.168.1.11 
STARTUP_MSG:   args = [-format] 
STARTUP_MSG:   version = 2.4.1 
...
...
10/24/14 21:30:56 INFO common.Storage: Storage directory 
/home/Hadoop/Hadoopinfra/hdfs/namenode has been formatted successfully. 
10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to 
retain 1 images with txid >= 0 
10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0 
10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************ 
SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11 
************************************************************/

Step 2 − Verifying Hadoop dfs

Executing “start dfs” command will start your Hadoop file system.

Command: $ start-dfs.sh

expected output:

10/24/14 21:37:56 
Starting namenodes on [localhost] 
localhost: starting namenode, logging to /home/Hadoop/Hadoop
2.4.1/logs/Hadoop-Hadoop-namenode-localhost.out 
localhost: starting datanode, logging to /home/Hadoop/Hadoop
2.4.1/logs/Hadoop-Hadoop-datanode-localhost.out 
Starting secondary namenodes [0.0.0.0]

Step 3 − Verifying Yarn Script

Executing “start-yarn” command will start your yarn daemons.

Command : $ start-yarn.sh 

Expected output:

starting yarn daemons 
starting resourcemanager, logging to /home/Hadoop/Hadoop
2.4.1/logs/yarn-Hadoop-resourcemanager-localhost.out 
localhost: starting nodemanager, logging to /home/Hadoop/Hadoop
2.4.1/logs/yarn-Hadoop-nodemanager-localhost.out

Step 4 − Accessing Hadoop on Web Browser

The default port number (localhost) for Hadoop is 50070. The http://localhost:50070/ url is used for Hadoop services on web browser. The image below shows the output of url on web page:

 Accessing Hadoop on Web Browser

Step 5 − Verify All Applications of Cluster

The default localhost port number to access applications of cluster is 8088. Use http://localhost:8088/ url to visit this service. The image below shows all applications page on web browser:

Verify All Applications of Cluster

Use Case of a Hadoop cluster

Hadoop Cluster in Facebook:

Hadoop clusters are used to save the copies of dimension data sources, internal log, as a source for analytics, machine learning, and reporting. 

At present, Facebook holds two clusters, and they are as follows:

  • A 1100-machine cluster with 800 cores and about 12 PB raw storage.
  • A 300 machine cluster with 2,400 cores and about 3 PB raw storage. 

Every commodity node of the cluster has 8 cores and 12 TB storage.

Facebook developers have used Hive to build a higher-level data warehousing and use Java APIs for application development. Facebook has developed a FUSE application based on HDFS.

Hadoop Cluster in Facebook

Sample Cluster Configuration of Hadoop in Facebook:

The Hadoop cluster has master slave architecture. The image below represents the master slave architecture of Hadoop cluster:

Configuration of Hadoop in Facebook

Individual Configurations of Hadoop Cluster :

Individual Configurations of Hadoop Cluster

The above image explains the configuration of every node in a cluster. 

NameNode: Requires high memory and will have a lot of RAM and does not require much memory on hard disk. 

Secondary NameNode: Its memory requirement for a is not as high compared to primary NameNode. 

DataNodes: Each datanode requires 16 GB of memory because they are supposed to store data. So, they are high on hard disk and have multiple drives.

Explore Hadoop Sample Resumes! Download & Edit, Get Noticed by Top Employers!Download Now!

List of Big Data Courses:

 Hadoop Administration MapReduce
 Big Data On AWS Informatica Big Data Integration
 Bigdata Greenplum DBA Informatica Big Data Edition
 Hadoop Hive Impala
 Hadoop Testing Apache Mahout

 

Course Schedule
NameDates
Hadoop TrainingNov 02 to Nov 17View Details
Hadoop TrainingNov 05 to Nov 20View Details
Hadoop TrainingNov 09 to Nov 24View Details
Hadoop TrainingNov 12 to Nov 27View Details
Last updated: 28 Sep 2024
About Author

Vaishnavi Putcha was born and brought up in Hyderabad. She works for Mindmajix e-learning website and is passionate about writing blogs and articles on new technologies such as Artificial intelligence, cryptography, Data science, and innovations in software and, so, took up a profession as a Content contributor at Mindmajix. She holds a Master's degree in Computer Science from VITS. Follow her on LinkedIn.

read less