Hadoop is a globally-used, open source software programming framework which is based on Java programming with some native code of C and shell scripts. It can effectively manage large data, both structured and unstructured formats on clusters of computers using simple programming models.
The Hadoop application framework provides computation across clusters of computers and distributed storage.Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage. Hadoop follows the master slave architecture, and it is described below. In order to get a clear understanding of the below section, you should first go through the link by clicking here.
Table of Contents
In this architecture, the Master is either Namenode or JobTraker or both, and the Slave is multiple DataNodes and TaskTrakers ({DataNode, TaskTracker}, ….. {DataNode, TaskTracker})
Map/Reduce and HDFS are the primary components of Hadoop cluster.
MapReduce: MapReduce is a programming model associated for implementation by generating and processing big data sets with parallel and distributed algorithms on a cluster. The following is the Map/Reduce Master-slave architecture.
Master: JobTraker
Slaves: {tasktraker}……{Tasktraker}
HDFS (Hadoop distributed file system): HDFS is a part of Apache Software Foundation designed to support a fault-tolerant file system that can run on any hardware commodity.
The following is the HDFS Master-slave architecture.
Master: NameNode
Slave: {Datanode}…..{Datanode}
This section of article describes how to edit and set up several deployment configuration files (Core-site.xml, hdfs-site.xml) for HDFS and MapReduce.
Hortonworks will provide a set of configuration files that can be taken as reference but, they need to be modified according to our cluster environment to represent the working of MapReduce and Hadoop distributed file system configuration.
The following are the steps to configure files to set up HDFS and MapReduce environment:
Edit and modify the following core-site.xml properties according to your environment:
Modify and edit the following hdfs-site.xml properties according to your environment:
< property > < name >dfs.namenode.name.dir< /name > < value >/grid/Hadoop/hdfs/nn,/grid1/Hadoop/hdfs/nn< /value > < description >Comma-separated list of paths. Use the list of directories from $DFS_NAME_DIR. For example, /grid/Hadoop/hdfs/nn,/grid1/Hadoop/hdfs/nn.< /description > < /property > < property > < name >dfs.datanode.data.dir< /name > < value >file:///grid/Hadoop/hdfs/dn, file:///grid1/Hadoop/hdfs/dn < description >Comma-separated list of paths. Use the list of directories from $DFS_DATA_DIR. For example, file:///grid/Hadoop/hdfs/dn, file:///grid1/ Hadoop/hdfs/dn.< /description > < /property > < property > < name >dfs.namenode.http-address< /name > < value >$namenode.full.hostname:50070< /value > < description >Enter your NameNode hostname for http access.< /description > < /property > < property > < name >dfs.namenode.secondary.http-address< /name > < value >$secondary.namenode.full.hostname:50090< /value > < description >Enter your Secondary NameNode hostname.< /description > < /property > < property > dfs.namenode.checkpoint.dir < value >/grid/Hadoop/hdfs/snn,/grid1/Hadoop/hdfs/snn,/grid2/Hadoop/hdfs/snn < description >A comma-separated list of paths. Use the list of directories from $FS_CHECKPOINT_DIR. For example, /grid/Hadoop/hdfs/snn,sbr/grid1/Hadoop/hdfs/ snn,sbr/grid2/Hadoop/hdfs/snn < /description > < /property > < property > < name > dfs.namenode.checkpoint.edits.dir< /name > /grid/Hadoop/hdfs/snn,/grid1/Hadoop/hdfs/snn,/grid2/Hadoop/hdfs/snn A comma-separated list of paths. Use the list of directories from $FS_CHECKPOINT_DIR. For example, /grid/Hadoop/hdfs/snn,sbr/grid1/Hadoop/hdfs/ snn,sbr/grid2/Hadoop/hdfs/snn < /property > < property > < name >dfs.namenode.rpc-address< /name > < value>namenode_host_name:8020> < description >The RPC address that handles all clients requests.< /description>< /value>< /property> < property > < name >dfs.namenode.https-address< /name > < value >namenode_host_name:50470> The namenode secure http server address and port. < /description > /value >< /property >
Edit and modify the following yarn-site.xml properties according to your environment:
< property > < name >yarn.resourcemanager.scheduler.class< /name >org.apache.Hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler< /value > < /property > < property > < name >yarn.resourcemanager.resource-tracker.address< /name > < value >$resourcemanager.full.hostname:8025< /value > < description >Enter your ResourceManager hostname.< /description > < /property > < property > < name >yarn.resourcemanager.scheduler.address< /name > < value >$resourcemanager.full.hostname:8030< /value > < description >Enter your ResourceManager hostname.< /description > < /property > < property > < name >yarn.resourcemanager.address< /name > < value >$resourcemanager.full.hostname:8050< /value > < description >Enter your ResourceManager hostname.< /description > < /property > < property > < name >yarn.resourcemanager.admin.address< /name > < value >$resourcemanager.full.hostname:8141< /value > < description >Enter your ResourceManager hostname.< /description > < /property > < property > < name >yarn.nodemanager.local-dirs< /name > < value >/grid/Hadoop/yarn/local,/grid1/Hadoop/yarn/local< /value > < description >Comma separated list of paths. Use the list of directories from $YARN_LOCAL_DIR.For example, /grid/Hadoop/yarn/local,/grid1/Hadoop/yarn/ local. < /property > < property > < name >yarn.nodemanager.log-dirs< /name > < value >/grid/Hadoop/yarn/log< /value > < description >Use the list of directories from $YARN_LOCAL_LOG_DIR. For example, /grid/Hadoop/yarn/log,/grid1/Hadoop/yarn/ log,/grid2/Hadoop/yarn/log< /description > < /property > < property > < name >yarn.nodemanager.recovery< /name.dir > < value >{Hadoop.tmp.dir}/yarn-nm-recovery< /value > < /property > < property > < name >yarn.log.server.url< /name > < value >http://$jobhistoryserver.full.hostname:19888/jobhistory/logs/< / value > < description >URL for job history server< /description > < /property > < property > yarn.resourcemanager.webapp.address< /name > < value >$resourcemanager.full.hostname:8088< /value > < description >URL for job history server< /description> < /property > < name >yarn.timeline-service.webapp.address< /name > < value >< Resource_Manager_full_hostname >:8188< /value > < /property >
Edit and modify the following mapred-site.xml properties according to your environment:
< property > < name >mapreduce.jobhistory.address< /name > < value >$jobhistoryserver.full.hostname:10020< /value> < description >Enter your JobHistoryServer hostname.< /description > < /property > < property > < name >mapreduce.jobhistory.webapp.address< /name > < value >$jobhistoryserver.full.hostname:19888< /value > < description >Enter your JobHistoryServer hostname< /description > < /property >
On each node of the cluster, create an empty file named dfs.exclude inside $Hadoop_CONF_DIR.
Append the following to /etc/profile: touch $Hadoop_CONF_DIR/dfs.exclude JAVA_HOME= export JAVA_HOME Hadoop_CONF_DIR=/etc/Hadoop/conf/ export Hadoop_CONF_DIR export PATH=$PATH:$JAVA_HOME:$Hadoop_CONF_DIR
It is optional to Configure MapReduce to use Snappy Compression feature. To enable this feature for core-site.xml, mapred-site.xml, and MapReduce jobs, add the properties listed below according to your environment requirements:
< property > < name >mapreduce.admin.map.child.java.opt< /name > < value >-server -XX:NewRatio=8 -Djava.library.path=/usr/hdp/current/Hadoop/lib/native/ -Djava.net.preferIPv4Stack=true< /value > true < property > < name >mapreduce.admin.reduce.child.java.opts< /name > < value >-server -XX:NewRatio=8 -Djava.library.path=/usr/hdp/current/Hadoop/lib/native/ -Djava.net.preferIPv4Stack=true < final >true< /final > < /property > Add the SnappyCodec to the codecs list in core-site.xml: < property > < name >io.compression.codecs< /name > < value >org.apache.Hadoop.io.compress.GzipCodec,org.apache.Hadoop.io.compress.DefaultCodec,org.apache.Hadoop.io.compress.SnappyCodec< /value > < /property >
The following table will briefly explain the contents of hdfs-site.xml file and the configuration settings for HDFS daemons; the Secondary NameNode, DataNodes, and the NameNode.
PROPERTY | VALUE | DESCRIPTION |
dfs.data.dir | /disk1/hdfs/data, /disk1/hdfs/data |
All the directories where data nodes stores blocks, each block is stored in one of these directories ${hadoop.tmp.dir}/dfs/data |
fs.checkpoint.dir | /disk1/hdfs/namesecondary, /disk1/hdfs/namesecondary | A list of directories where secondary name nodes stores checkpoints. It stores a copy of the checkpoint in each directory in the list ${hadoop.tmp.dir}/dfs/namesecondary |
The following table describes the version of mapred-site.xml and configuration settings for MapReduce daemons; the master
Frequently Asked Hadoop Interview Questions
{job tracker} and the slave {task-trackers} and properties to be mentioned according to our environment:
Mapred-site.xml
< property > < name >mapred.job.tracker< /name > < value >localhost:8021< /value > < /property > < /configuration >
Defining mapred-site.xml: It contains the configuration settings that are common for both HDFS and MapReduce like I/O settings.
Property | Value | Description |
mapred.job.tracker | localhost:8021 | The hostname and the port that the jobtracker RPC server runs on. If set to default value of local, then the jobtracker runs in process on demand when a mapreduce job. |
mapred.local.dir | ${hadoop.tmp.dir}/mapred/local | A list of directories where Mapreduce intermediate data for jobs. |
mapred.system.dir | ${hadoop.tmp.dir}/mapred/system | The directory relative to fs.default.name where shared files are stored, during a job run. |
mapred.tasktracker.map.tasks.maximum | 2 | The number of map tasks that can run at a time. |
mapred.tasktracker.reduce.tasks.maximum | 2 | The number of reduce tasks that can run at a time. |
Installing Java
You can go through the steps to install Java and Hadoop here.
Pre-installation Setup
Before installing Hadoop, we need to know the prerequisites for installation. The following are the prerequisites:
Memory: Minimum 8GB RAM is required.
Processor model and speed: quad, hexa, or octa core processor with 2-2.5 GHz speed.
Operating system and Network requirements: Installing Hadoop with Linux is better rather than on Windows, and for learning purpose, install Hadoop in pseudo-distributed mode. The advantage of using pseudo-distributed mode is that the cluster can be controlled through a single system.
Before installing Hadoop into the Linux environment, we need to follow the steps listed below to set up Secure Shell in Linux environment.
Step 1: Creating a User
Create a separate user for Hadoop to differentiate Unix file system and Hadoop file system in kb. Follow the steps to create a user:
EXAMPLE: Open the Linux command prompt and type the following commands:
$ su password: # useradd Hadoop # passwd Hadoop New passwd: Retype new passwd
SSH setup is required to perform different operations on a cluster like distributed daemon shell operations, starting and stopping operations. To authenticate various users of Hadoop, it is must to provide private or public key pair for a Hadoop user to share it with different users.
The public keys are copied from id_rsa.pub used as authorized_keys and owner provides read and write permissions to authorized_keys
The following is the shell script code for public or private key generation:
$ ssh-keygen -t rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys $ chmod 0600 ~/.ssh/authorized_keys
The Following are the steps to install Hadoop 2.4.1 in pseudo distributed mode.
Step 1 − Extract all downloaded files:
The following command is used to extract files on command prompt:
Command: cd Downloads
Step 2 − Create soft links (shortcuts).
The following command is used to create shortcuts:
Command: ln -s ./Downloads/hadoop-2.7.2/ ./hadoop
Step 3 − Configure .bashrc
This following code is to modify PATH variable in bash shell.
Command: vi ./.bashrc
The following code exports variables to path :
export HADOOP_HOME=/home/luck/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
Step 4 − Configure Hadoop in Stand-alone mode:
The following command is used to Configure Hadoop’s hadoop-env.sh file
Command: vi ./hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/home/luck/jdk
Step 5 − Exit and re-open the command prompt
Step 6 − Run a Hadoop job on Standalone cluster
To run hadoop test the hadoop command. The usage message must be displayed.
Step 7 − Go to the directory you have downloaded the compressed Hadoop file and unzip using terminal
Command: $ tar -xzvf hadoop-2.7.3.tar.gz
Step 8 − Go to the Hadoop distribution directory.
Command: $ cd /usr/local/hadoop
FInal output after installation and configuration:
By default, Hadoop is configured in standalone mode and is run in a non-distributed mode on a single physical system. If setup is installed and configured properly, then the following result is displayed on the command prompt:
Hadoop 2.4.1 Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768 Compiled by hortonmu on 2013-10-07T06:28Z Compiled with protoc 2.5.0 From source with checksum 79e53ce7994d1628b240f09af91e1af4
The above result shows that Hadoop standalone mode setup is working.
The following Xml files must be reconfigured in order to develop Hadoop in Java:
Core-site.xml
Mapred-site.xml
Hdfs-site.xml
The core-site.xml file contains information regarding memory allocated for the file system, the port number used for Hadoop instance, size of Read/Write buffers, and memory limit for storing the data.
Open the core-site.xml with the following command and add the properties listed below in between, tags in this file.
Command: ~$ sudo gedit $HADOOP_HOME/etc/hadoop/core-site.xml
< property > < name >hadoop.tmp.dir< /name > < value >/app/hadoop/tmp< /value > < description >Parent directory for other temporary directories.< /description > < /property > < property > < name >fs.defaultFS < /name > < value >hdfs://localhost:54310< /value > < description >The name of the default file system. < /description > < /property >
This Mapred-site.xml file is used to specify the MapReduce framework currently in use.
Open mapred-site.xml file with the following command and add the following properties in between the , tags in this file.
Command: ~$ sudo gedit $HADOOP_HOME/etc/hadoop/mapred-site.xml
< property > < name >mapreduce.jobtracker.address< /name > < value >localhost:54311< /value > < description >MapReduce job tracker runs at this host and port. < /description > < /property >
The hdfs-site.xml file contains information regarding the namenode path, datanode paths of the local file systems, the value of replication data, etc. It means the place where you want to store the
Hadoop infrastructure.
Open the Hdfs-site.xml file with following command and add the properties listed below in between the tags.
Command: ~$ sudo gedit $HADOOP_HOME/etc/hadoop/hdfs-site.xml
< property > < name >dfs.replication< /name > < value >1< /value > < description >Default block replication.< /description > < /property > < property > < name >dfs.datanode.data.dir < value >/home/hduser_/hdfs< /value > < /property >
The Following are the steps to install Hadoop 2.4.1 in pseudo distributed mode.
Step 1 − Setting Up Hadoop
Hadoop environment variables are appended into the following commands to file (~/.bashrc):
export Hadoop_HOME=/usr/local/Hadoop export Hadoop_MAPRED_HOME=$Hadoop_HOME export Hadoop_COMMON_HOME=$Hadoop_HOME export Hadoop_HDFS_HOME=$Hadoop_HOME export YARN_HOME=$Hadoop_HOME export Hadoop_COMMON_LIB_NATIVE_DIR=$Hadoop_HOME/lib/native export PATH=$PATH:$Hadoop_HOME/sbin:$Hadoop_HOME/bin export Hadoop_INSTALL=$Hadoop_HOME
Apply the changes to the system currently running.
Command: $ source ~/.bashrc
Step 2 − Hadoop Configuration
Hadoop configuration files are located at “$Hadoop_HOME/etc/Hadoop”. According to the Hadoop infrastructure, the changes are made to those configuration files if required.
Command: $ cd $Hadoop_HOME/etc/Hadoop
To develop Hadoop programing in Java, we need to change the Java variables in Hadoop-env.sh file by replacing JAVA_HOME value with the location of Java in the system.
Example:
Command: export JAVA_HOME=/usr/local/jdk1.7.0_71
The following Xml files must be reconfigured in order to develop Hadoop in Java:
Hdfs-site.xml:
The hdfs-site.xml file contains information regarding the namenode path, datanode paths of the local file systems, the value of replication data, etc. It means the place where you want to store the
Hadoop infrastructure.
Let’s assume:
dfs.replication (data replication value) = 1
namenode path = //home/Hadoop/Hadoopinfra/hdfs/namenode
(In the path mentioned above: /Hadoop/ is the user name. Hadoopinfra/hdfs/namenode is the directory generated by hdfs file system.)
datanode path = //home/Hadoop/Hadoopinfra/hdfs/datanode
(In the path mentioned above: Hadoopinfra/hdfs/datanode is the directory generated by hdfs file system.)
Open the Hdfs-site.xml file and add the properties listed below in between the tags. In this file, all the property values are user-defined and can be changed according to the Hadoop infrastructure.
< configuration > < property > < name >dfs.replication< /name > < value >1< /value > < /property > < property > < name >dfs.name.dir< /name > < value >file:///home/Hadoop/Hadoopinfra/hdfs/namenode < /value> < /property > < property > < name >dfs.data.dir< /name > file:///home/Hadoop/Hadoopinfra/hdfs/datanode < /value > < /property > < /configuration >
Yarn-site.xml:
Yarn-site.xml.template is a default template. This Yarn-site.xml file is used to configure yarn into Hadoop environment.
Open the yarn-site.xml file and add the following properties in between the , tags in this file.
< configuration > < property > < name >yarn.nodemanager.aux-services< /name > < value >mapreduce_shuffle< /value > < /property > < /configuration >
site.xml file.
The following command is used to copy mapred-site.xml.template to mapred-site.xml file:
Command: $ cp mapred-site.xml.template mapred-site.xml
Open mapred-site.xml file and add the following properties in between the , tags in this file.
mapreduce.framework.name yarn
To configure a Hadoop cluster in fully-distributed mode , we need to configure all the master and slave machines. Even though it is different from the pseudo-distributed mode, the configuration method will be same.
The following are steps to configure Hadoop cluster in fully-distributed mode:
Step 1 − Setting Up Hadoop environment variables
master and all the slaves have the same user and all nodes in the cluster as mentioned below: Hadoop environment variables are appended into the following commands to file (~/.bashrc) to Export variables
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67 export HADOOP_PREFIX=/home/impadmin/hadoop-2.6.4 export PATH=$HADOOP_PREFIX/bin:$JAVA_HOME/bin:$PATH export HADOOP_COMMON_HOME=$HADOOP_ PREFIX export HADOOP_HDFS_HOME=$HADOOP_ PREFIX export YARN_HOME=$HADOOP_ PREFIX export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop export YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
Apply the changes to the system currently running.
Command: $ source ~/.bashrc
Step 2: configuration
Add all the export commands listed below at start of script in Command etc/hadoop/yarn-env.sh
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67 export HADOOP_PREFIX=/home/impadmin/hadoop-2.6.4 export PATH=$PATH:$HADOOP_PREFIX/bin:$JAVA_HOME/bin:. export HADOOP_COMMON_HOME=$HADOOP_ PREFIX export HADOOP_HDFS_HOME=$HADOOP_ PREFIX export YARN_HOME=$HADOOP_ PREFIX export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop export YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
Step 3: Create a folder for hadoop.tmp.dir
Create a temporary folder in HADOOP_PREFIX
Command mkdir -p $HADOOP_PREFIX/tmp
Step 4: Tweak config files
Add the all the properties mentioned below between configuration tag to all the machines in cluster under HADOOP_ PREFIX file ⇒ etc/hadoop folder:
The following Xml files must be reconfigured in order to develop Hadoop in Java:
Core-site.xml
The core-site.xml file contains information regarding memory allocated for the file system, the port number used for Hadoop instance, size of Read/Write buffers, and memory limit for storing the data.
Open the core-site.xml and add the properties listed below in between , tags in this file.
< property > < name >fs.default.name< /name > < value >hdfs://Master-Hostname:9000< /value> < /property > < property > < name >hadoop.tmp.dir< /name > < value >/home/impadmin/hadoop-2.6.4/tmp< /value > < /property >
Hdfs-site.xml
The hdfs-site.xml file contains information regarding the namenode path, datanode paths of the local file systems, the value of replication data, etc. It means the place where you want to store the Hadoop infrastructure.
Open the Hdfs-site.xml file and add the properties listed below in between the tags. In this file, all the property values are user-defined and can be changed according to the Hadoop infrastructure.
< property > < name >dfs.replication< /name> < value >2< /value> < /property > < property > < name >dfs.permissions< /name > < value >false< /value > < /property >
Mapred-site.xml :
This Mapred-site.xml file is used to specify the MapReduce framework currently in use. Firstly, to replace default template, it is required to copy the file from mapred-site.xml.template to mapred-site.xml file.
Open the mapred-site.xml file and add the following properties in between the , tags in this file.
< property > < name >mapreduce.framework.name< /name > < value >yarn< /value > < /property >
Yarn-site.xml :
Yarn-site.xml.template is a default template. This Yarn-site.xml file is used to configure yarn into Hadoop environment Remember to replace “Master-Hostname” with host name of cluster’s master.
Open the yarn-site.xml file and add the following properties in between the , tags in this file.
< property > < name >yarn.nodemanager.aux-services< /name > < value >mapreduce_shuffle< /value > < /property > < property > < name >yarn.nodemanager.aux-services.mapreduce.shuffle.class< /name > < value >org.apache.hadoop.mapred.ShuffleHandler< /value > < /property > < property > < name >yarn.resourcemanager.resource-tracker.address< /name > < value >Master-Hostname:8025< /value > < /property > < property > < name >yarn.resourcemanager.scheduler.address< /name > < value >Master-Hostname:8030< /value > < /property > < property > < name >yarn.resourcemanager.address< /name > < value >Master-Hostname:8040< /value > < /property >
By following these steps, we can verify the Hadoop installation.
Step 1: Setting up NameNode
The following command is used for setting up NameNode “hdfs namenode -format”:
$ cd ~
$ hdfs namenode -format
The expected result :
10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = localhost/192.168.1.11 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 2.4.1 ... ... 10/24/14 21:30:56 INFO common.Storage: Storage directory /home/Hadoop/Hadoopinfra/hdfs/namenode has been formatted successfully. 10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0 10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0 10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11 ************************************************************/
Step 2 − Verifying Hadoop dfs
Executing “start dfs” command will start your Hadoop file system.
Command: $ start-dfs.sh
expected output:
10/24/14 21:37:56 Starting namenodes on [localhost] localhost: starting namenode, logging to /home/Hadoop/Hadoop 2.4.1/logs/Hadoop-Hadoop-namenode-localhost.out localhost: starting datanode, logging to /home/Hadoop/Hadoop 2.4.1/logs/Hadoop-Hadoop-datanode-localhost.out Starting secondary namenodes [0.0.0.0]
Step 3 − Verifying Yarn Script
Executing “start-yarn” command will start your yarn daemons.
Command : $ start-yarn.sh
Expected output:
starting yarn daemons starting resourcemanager, logging to /home/Hadoop/Hadoop 2.4.1/logs/yarn-Hadoop-resourcemanager-localhost.out localhost: starting nodemanager, logging to /home/Hadoop/Hadoop 2.4.1/logs/yarn-Hadoop-nodemanager-localhost.out
Step 4 − Accessing Hadoop on Web Browser
The default port number (localhost) for Hadoop is 50070. The http://localhost:50070/ url is used for Hadoop services on web browser. The image below shows the output of url on web page:
Step 5 − Verify All Applications of Cluster
The default localhost port number to access applications of cluster is 8088. Use http://localhost:8088/ url to visit this service. The image below shows all applications page on web browser:
Hadoop Cluster in Facebook:
Hadoop clusters are used to save the copies of dimension data sources, internal log, as a source for analytics, machine learning, and reporting.
At present, Facebook holds two clusters, and they are as follows:
Every commodity node of the cluster has 8 cores and 12 TB storage.
Facebook developers have used Hive to build a higher-level data warehousing and use Java APIs for application development. Facebook has developed a FUSE application based on HDFS.
Sample Cluster Configuration of Hadoop in Facebook:
The Hadoop cluster has master slave architecture. The image below represents the master slave architecture of Hadoop cluster:
Individual Configurations of Hadoop Cluster :
The above image explains the configuration of every node in a cluster.
NameNode: Requires high memory and will have a lot of RAM and does not require much memory on hard disk.
Secondary NameNode: Its memory requirement for a is not as high compared to primary NameNode.
DataNodes: Each datanode requires 16 GB of memory because they are supposed to store data. So, they are high on hard disk and have multiple drives.
Hadoop Administration | MapReduce |
Big Data On AWS | Informatica Big Data Integration |
Bigdata Greenplum DBA | Informatica Big Data Edition |
Hadoop Hive | Impala |
Hadoop Testing | Apache Mahout |
Name | Dates | |
---|---|---|
Hadoop Training | Nov 05 to Nov 20 | View Details |
Hadoop Training | Nov 09 to Nov 24 | View Details |
Hadoop Training | Nov 12 to Nov 27 | View Details |
Hadoop Training | Nov 16 to Dec 01 | View Details |
Vaishnavi Putcha was born and brought up in Hyderabad. She works for Mindmajix e-learning website and is passionate about writing blogs and articles on new technologies such as Artificial intelligence, cryptography, Data science, and innovations in software and, so, took up a profession as a Content contributor at Mindmajix. She holds a Master's degree in Computer Science from VITS. Follow her on LinkedIn.