Nowadays, every business organization processes a big amount of data to perform its day to day activities. In such a scenario, it becomes very difficult to host or manage the data in a centralized location due to constraints like capacity and cost. But, with HDFS, this task can be accomplished easily. HDFS stands for Hadoop Distributed File System.
It is a Block-Structured file system which divides every file into blocks having a predetermined size. These blocks are stored over the cluster of single or multiple machines. HDFS follows the Master/Slave Architecture in which a cluster includes the Single NameNode known as Master Node and the other available nodes are known as DataNodes (Slaves Nodes). HDFS works upon the concept of storing large files in less number rather than storing small files in huge numbers.
Hadoop HDFS Architecture - Table of Contents
HDFS Architecture comprises Slave/Master Architecture where the Master is NameNode in which MetaData is stored and Slave is the DataNode in which actual data is stored. This architecture can be deployed over the broad spectrum of machines which support Java. However, a user can run the multiple DataNodes on a single machine. But, in the real world, the DataNodes are spread across the multiple machines. The architecture is explained in detail further in the article.
Hardware Failure is not an exception anymore. HDFS instance incorporates thousands of server machines and every machine stores part of data of file system. There are a large number of components which are prone to the machine or hardware failure. This simply indicates that there are always some non-functional components. Thus, the goal of HDFS Architecture is automatic and quick fault detection, and its recovery.
[Related Article: An Overview Of Hadoop Hive ]
HDFS applications require streaming access to the data sets. HDFS Architecture is designed basically to perform batch processing rather its interactive use by the users. The main focus is on the high data throughput instead of data access to low latency. It has a main focus on data retrieving the fastest speed during the log analysis.
HDFS usually works with big data sets. In HDFS, the standard size of file ranges from gigabytes to terabytes. The HDFS architecture is designed in such a manner that the huge amount of data can be stored and retrieved in an easy manner. HDFS must deliver a high data bandwidth and must be able to scale hundreds of nodes using a single cluster. The architecture must be efficient enough to handle tens of millions of files in just a single instance.
[Related Article: Hadoop Configuration With ECLIPSE ON Windows]
The model follows the write-once and read-many concept for files. Once the file has been created, written and closed must not be changed. This helps in resolving the data-coherency problem and allow to have high throughput data access. This model perfectly supports Web-Crawler applications and Mapreduce based application.
If an application performs the computation near the data it is operating, the results are more efficient. This process delivers more useful results when a user is dealing with big datasets. One of the biggest advantages of this is that it increases the system’s overall throughput and reduces network congestion. Thus, moving computation closer to data is more beneficial than moving the data to computation.
[Related Article: Installation Of Hadoop Eco-Systems]
HDFS supports the portable properties which are portable from one platform to another platform. It allows having widespread HDFS adoption. This platform is best to use when you have to deal with the big sets of data.
Hadoop Distributed File System 9HDFS) Architecture is a block-structured file system in which the division of file is done into the blocks having predetermined size. These blocks are stored on the different clusters. HDFS follows the master/slave architecture in which clusters comprise single NameNode referred to as Master Node and other nodes are referred to as the DataNodes or Slave Nodes. The deployment of HDFS architecture can be done over the broad spectrums of machines which support Java. However, one is capable of running different DataNodes on the single machine whereas, in reality, these DataNodes are present on the different machines.
HDFS allows a rapid data transfer between the computer nodes. At the initial stage, HDFS is closely coupled with MapReduce which is a programming Framework used for processing the data.
HDFS divides the information into separate blogs and distributes those blogs to various nodes present in the cluster. Thus, it enables efficient parallel processing.
HDFS architecture has high fault tolerance. The filesystem copies or replicates every piece of data multiple times and then distributes the copies to the different nodes. This makes sure that, if the data present on a node crashes, it can be easily found somewhere else within the cluster. This allows you to have regular data processing during the data recovery.
Hdfs architecture works on the master/slave architecture. At the initial stage, every Hadoop cluster consisted of a single NameNode that manages the file system operations and support data nodes that manage data storage on the particular computer nodes. HDFS elements are combined together for supporting application having big data sets.
Hdfs architecture NameNode is also known as the master node. HDFS NameNode stores the metadata. For example - a number of replicas, data blocks, and other information. This metadata is present in the master memory for a quick data retrieval. NameNode manages and maintains the slaves' nodes and is responsible for assigning tasks to them. NameNode must be deployed over the reliable hardware as it works as a centerpiece od HDFS of architecture.
Data nodes represent the slave nodes in HDFS architecture. Unlike the NameNodes, DataNode is basically commodity hardware which is a non-expensive system that is not of high availability or high quality. DataNodes stores the actual data in HDFS and execute the read and write operation according to the request of the client. It can also deploy commodity hardware.
DataNode is a block server which stores the data in local files such as ext3 or ext4.
[Related Page: Understanding Data Parallelism in MapReduce]
[Related Page: Prerequisites for Learning Hadoop]
When a NameNode starts in HDFS, firstly it read the state of HDFS from FsImage ( image file). later, from the edit log file, it applies the edits and then, NameNode write the new state of HDFS to the image file. After that, it will start the normal operations with an empty edits file. During the startup time, NameNode merge the edits files and image files, so that edit log file becomes larger with time. The limitation of big edits files is that the upcoming NameNode restart will take a long time.
Secondary NameNode resolves this problem. Secondary NameNode downloads the FsImage and EditLogs from the NameNode and later it merge the FSImage and EditNodes together. It keeps the size of Edit Logs within the limit and uses the persistent storage for storing the modified FsImage. This can be later used in the case of failure of the NameNode.
[Related Article: Introduction to HBase for Hadoop]
Checkpoint node is the node that creates the Namespace checkpoints on a periodical basis. Hadoop Checkpoint node will first download the Edits and FsImage from the Active Namenode and then it combines the both ( EditLogs and FsImage). At last, it will upload the new image to the NameNode. It maintains the latest checkpoint in the directory whose structure is similar to the directory of NameNode. This allows the checkpoint image to be present always for NameNode for reading if required.
[Related Page: Leading Hadoop Vendors in BigData]
Backup Node in HDFS provides similar checkpointing functionality as the Checkpoint node provides. In the Hadoop, the Backup node is used to store an in-memory and updated file system namespace copy. The backup node is always synchronized with an active NameNode space.
In the HDFS Architecture, Backup Node does not require to download the edit files and FsImage from the active NameNode for creating the checkpoint. It has already an updated state of name state in the memory. The process of Backup Node checkpoint is more efficient it as it only requires to save the namespace into the local FsImage files and reset the edits.
NameNode supports only a single Backup Node at a particular interval of time.
Frequently Asked Hadoop Interview Questions
As we all know that, in HDFS Architecture Data is stored on the different Data blocks which are known as Blocks.
A block is the smallest continuous location present on the hard drive where you have stored the data. In other words, In the file system, data is stored as the collection of blocks. Similarly, in the HDFS, files are stored in the form of blocks which are shared across the Apache Hadoop Cluster.
By default, the size of every Block is 128 MB in the Apache Hadoop 2.x ((64 MB in Apache Hadoop 1.x). You can configure this according to your requirements.
It is not mandatory that, every file in HDFS is stored in the exact multiples of configured block size like 128 MB, 256 MB and many more.
[Related Page: Hadoop Jobs Salary Trends in the USA ]
For example - You have a file called 'example.txt' of size 514 MB as displayed in the picture below and you use the default configuration of block size i.e 128 MB.
example.txt | ||||
514 MB | ||||
a |
b |
c |
d |
e |
128MB |
128MB |
128MB |
128MB |
2MB |
A number of blocks in the file: First four blocks will of size- 128 MB and the size of the last block will be of 2MB size only.
Now you might be thinking that why you require to have a big Block size of 128 MB?
If you talk about the HDFS, you cannot ignore the big data sets which have data in Terabytes or Petabytes of data. So, let's say, you have a block size of 4 KB, in the Linux File System. In this case, you will be having too much data and a huge number of blocks. thus, it will become very difficult to manage these data and blocks. Thus, a Big block size is required.
[Related Page: Hadoop HDFS Commands]
HDFS architecture offers the most efficient medium of storing the huge amount of data in the distributed environment in the form of data blocks. These blocks are then replicated for providing fault tolerance.
By default, Replication Factor = 3. Thus, you can configure it.
Below the figure is provided in which every block is replicated three times and stored on the different data node ( it considers the default replication factor).
Thus, if you store a file of size 128 MB in the HDFS by using the default configuration, you will consume the space up to 3*128 MB which is equal to 384 MB. The replication of each block is done three time and each replica resides on the different node.
Important Note- The NameNode collects the Block report from DataNode on regular basis for maintaining the replication factor. Thus, In case, the block is under-replicated or over-replicated the NameNode add or delete the replicas according to the requirement.
[Related Page: Cloudera Hadoop Certification]
In a large Hadoop cluster, for managing the traffic on network during the read/write operations in the HDFS file, NameNode selects the DataNode that is present closer to the similar or nearby rack to Read/Write request.
The Rack information is extracted by the NameNode through maintaining the rack ids of every DataNode. Hadoop Rack Awareness is the concept with select the DataNodes according to the rack information
NameNode always makes sure that all the replicas are not being stored in the single rack or the similar rack. It follows the In-built Rack Awareness Algorithm for reducing the latency and providing fault tolerance.
As the replication factor is 3, according to the Rack Awareness Algorithm, first block replica will be stored on the distinct (remote) rack but, on the different DataNode present within that (remote) rack as mentioned in the figure given below. In case, there are more replicas, the left replicas will be placed on some random Data Nodes provided not more than 2 replicas residing on a similar rack ( if it is possible).
This is the actual representation of Hadoop production cluster. Here, the user has multiple racks populated with different DataNodes.
[Related Page: Reasons to Learn to Hadoop & Hadoop Administration]
Following are the reasons why you need the Rack Awareness Algorithm:
[Related Page: Hadoop Ecosystem]
When the client wants to write a file to HDFS Architecture, it will interact with the NameNode for metadata. After that, NameNode reverts back with the block numbers, replicas, their location, and the other information. Depending upon the information from the NameNode, the client divides the file into different blocks. Later, the client starts sending the blocks to the first DataNode.
First of all, the client will send the Block A to DataNode 1 with the details of the other two DataNodes. When this block is delivered to the DataNode 1, the DataNode 1 share this block with DataNode 2 present on this rack.
As both the DataNodes are present in a similar rack, so the block transfer is done via a switch. Now, the DataNode 2 will copy the same block to the DataNode 3.
As both of the DataNodes are present in the different racks, block transfer takes place through the out-of-rack switch.
When the DataNode receives the block from the client, it sends a write confirmation to the NameNode.
This process is repeated for every block created for the file.
[Related Article: Big Data Analytics]
To read from the HDFS Architecture, first of all, the client interacts with the NameNode for metadata. The client gets the details like files name and their location from the NameNode.
The NameNode reverts with the number of blocks, replicas, their location, and the other information.
Now the client will interact with the DataNodes. The client will start reading the data from the DataNodes depending upon the information received from the NameNodes. After receiving all the blocks of the file, declined or application will combine these blocks together in the form of the original file.
[Related Page: Hadoop Installation and Configuration]
Following are the Key features supported by the HDFS:
[Related Page: MapReduce In Bigdata ]
The aforementioned information is providing the complete details about HDFS Architecture. With the help of NameNode and DataNode, a large amount of data and files can be stored in a very reliable manner across the different machines present in a big cluster. Thus, it becomes easy for the organization to manage its big amount of data without compromising its resources. HDFS stores the files in a sequence of blocks and the block replication helps in the tolerance of faults. It offers higher data availability, as the data is available all the time despite any hardware failure. If any hardware or machines crash, a user can access the data through some other path.
Hadoop Administration | MapReduce |
Big Data On AWS | Informatica Big Data Integration |
Bigdata Greenplum DBA | Informatica Big Data Edition |
Hadoop Hive | Impala |
Hadoop Testing | Apache Mahout |
Our work-support plans provide precise options as per your project tasks. Whether you are a newbie or an experienced professional seeking assistance in completing project tasks, we are here with the following plans to meet your custom needs:
Name | Dates | |
---|---|---|
Hadoop Training | Nov 23 to Dec 08 | View Details |
Hadoop Training | Nov 26 to Dec 11 | View Details |
Hadoop Training | Nov 30 to Dec 15 | View Details |
Hadoop Training | Dec 03 to Dec 18 | View Details |
Ravindra Savaram is a Technical Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.