What is Hadoop HDFS?
Nowadays, every business organization processes a big amount of data to perform its day to day activities. In such a scenario, it becomes very difficult to host or manage the data in a centralized location due to constraints like capacity and cost. But, with HDFS, this task can be accomplished easily. HDFS stands for Hadoop Distributed File System.
It is a Block-Structured file system which divides every file into blocks having a predetermined size. These blocks are stored over the cluster of single or multiple machines. HDFS follows the Master/Slave Architecture in which a cluster includes the Single NameNode known as Master Node and the other available nodes are known as DataNodes (Slaves Nodes). HDFS works upon the concept of storing large files in less number rather than storing small files in huge numbers.
Introduction to Hadoop HDFS Architecture
HDFS Architecture comprises Slave/Master Architecture where the Master is NameNode in which MetaData is stored and Slave is the DataNode in which actual data is stored. This architecture can be deployed over the broad spectrum of machines which support Java. However, a user can run the multiple DataNodes on a single machine. But, in the real world, the DataNodes are spread across the multiple machines. The architecture is explained in detail further in the article.
HDFS Assumption and Goals
Hardware Failure is not an exception anymore. HDFS instance incorporates thousands of server machines and every machine stores part of data of file system. There are a large number of components which are prone to the machine or hardware failure. This simply indicates that there are always some non-functional components. Thus, the goal of HDFS Architecture is automatic and quick fault detection, and its recovery.
[Related Article: An Overview Of Hadoop Hive ]
Streaming data access
HDFS applications require streaming access to the data sets. HDFS Architecture is designed basically to perform batch processing rather its interactive use by the users. The main focus is on the high data throughput instead of data access to low latency. It has a main focus on data retrieving the fastest speed during the log analysis.
HDFS usually works with big data sets. In HDFS, the standard size of file ranges from gigabytes to terabytes. The HDFS architecture is designed in such a manner that the huge amount of data can be stored and retrieved in an easy manner. HDFS must deliver a high data bandwidth and must be able to scale hundreds of nodes using a single cluster. The architecture must be efficient enough to handle tens of millions of files in just a single instance.
[Related Article: Hadoop Configuration With ECLIPSE ON Windows]
Simple coherency model
The model follows the write-once and read-many concept for files. Once the file has been created, written and closed must not be changed. This helps in resolving the data-coherency problem and allow to have high throughput data access. This model perfectly supports Web-Crawler applications and Mapreduce based application.
Moving computation is cheaper than moving data
If an application performs the computation near the data it is operating, the results are more efficient. This process delivers more useful results when a user is dealing with big datasets. One of the biggest advantages of this is that it increases the system’s overall throughput and reduces network congestion. Thus, moving computation closer to data is more beneficial than moving the data to computation.
[Related Article: Installation Of Hadoop Eco-Systems]
Portability across heterogeneous hardware and software platforms
HDFS supports the portable properties which are portable from one platform to another platform. It allows having widespread HDFS adoption. This platform is best to use when you have to deal with the big sets of data.
What is HDFS Architecture?
Hadoop Distributed File System 9HDFS) Architecture is a block-structured file system in which the division of file is done into the blocks having predetermined size. These blocks are stored on the different clusters. HDFS follows the master/slave architecture in which clusters comprise single NameNode referred to as Master Node and other nodes are referred to as the DataNodes or Slave Nodes. The deployment of HDFS architecture can be done over the broad spectrums of machines which support Java. However, one is capable of running different DataNodes on the single machine whereas, in reality, these DataNodes are present on the different machines.
[Related Article: Hadoop HBase Schema and Versioning ]
Working of HDFS Architecture
HDFS allows a rapid data transfer between the computer nodes. At the initial stage, HDFS is closely coupled with MapReduce which is a programming Framework used for processing the data.
HDFS divides the information into separate blogs and distributes those blogs to various nodes present in the cluster. Thus, it enables efficient parallel processing.
HDFS architecture has high fault tolerance. The filesystem copies or replicates every piece of data multiple times and then distributes the copies to the different nodes. This makes sure that, if the data present on a node crashes, it can be easily found somewhere else within the cluster. This allows you to have regular data processing during the data recovery.
Hdfs architecture works on the master/slave architecture. At the initial stage, every Hadoop cluster consisted of a single NameNode that manages the file system operations and support data nodes that manage data storage on the particular computer nodes. HDFS elements are combined together for supporting application having big data sets.
What is HDFS NameNode?
Hdfs architecture NameNode is also known as the master node. HDFS NameNode stores the metadata. For example - a number of replicas, data blocks, and other information. This metadata is present in the master memory for a quick data retrieval. NameNode manages and maintains the slaves' nodes and is responsible for assigning tasks to them. NameNode must be deployed over the reliable hardware as it works as a centerpiece od HDFS of architecture.
Different tasks managed by NameNode
- It is responsible for managing the file system namespace.
- It regulates the access of customers to the files.
- It is liable to take care of replication factors of the different blogs.
- It helps in executing the file system execution. For example- opening, closing, naming the files and directories.
- All the data node sends the block reports and Heartbeat to the NameNode present in the Hadoop cluster. This is done to make sure that DataNodes are still alive. A block report consists of the list of all the blocks on the DataNodes.
- It records each and every modification made in the filesystem metadata. For example - if any of the files are deleted in HDFS, the Namenode will immediately record the modification in the Edit logs.
- It keeps the track of each block present in the HDFS and the Nodes in which these blocks are located.
Following are the files present in the NameNode metadata:
- FsImage: FsImage file present in the NameNode is an 'Image File'. This file consists of an entire file system namespace and stored in the form of a file in the local file system of the NameNode. It consists of a serialized form of all the files and directory inodes present in the file system. Every inode is an internal representation of the directory or file metadata.
- Edit logs: Edit Logs consists of all the latest advancement made to the file system on the latest FsImage. The client sends create/update/delete request to the NameNode and later, this request is first recorded to the edits files.
[Related Page: Hadoop Heartbeat and Data Block Rebalancing]
What is HDFS DataNode?
Data nodes represent the slave nodes in HDFS architecture. Unlike the NameNodes, DataNode is basically commodity hardware which is a non-expensive system that is not of high availability or high quality. DataNodes stores the actual data in HDFS and execute the read and write operation according to the request of the client. It can also deploy commodity hardware.
DataNode is a block server which stores the data in local files such as ext3 or ext4.
[Related Page: Understanding Data Parallelism in MapReduce]
Different functions performed by DataNodes.
- DataNodes are the slaves or daemons which execute on every slave machine.
- DataNodes store the actual data.
- Block replica creation, updating, deletion, and replication is done as per the NameNode instructions.
- DataNode performs read and write request that is of low level from the client's file system.
- DataNodes send the heartbeats to NameNode on a regular interval of time to inform about the overall health of HDFS, this frequency is set to three seconds.
[Related Page: Prerequisites for Learning Hadoop]
What is the Secondary NameNode?
Subscribe to our youtube channel to get new updates..!
When a NameNode starts in HDFS, firstly it read the state of HDFS from FsImage ( image file). later, from the edit log file, it applies the edits and then, NameNode write the new state of HDFS to the image file. After that, it will start the normal operations with an empty edits file. During the startup time, NameNode merge the edits files and image files, so that edit log file becomes larger with time. The limitation of big edits files is that the upcoming NameNode restart will take a long time.
Secondary NameNode resolves this problem. Secondary NameNode downloads the FsImage and EditLogs from the NameNode and later it merge the FSImage and EditNodes together. It keeps the size of Edit Logs within the limit and uses the persistent storage for storing the modified FsImage. This can be later used in the case of failure of the NameNode.
[Related Article: Introduction to HBase for Hadoop]
Secondary NameNode performs the following functions:
- Secondary NameNode is responsible for reading all the file system and metadata from the RAM of the NameNode. Then, it writes it into the file system or hard disk.
- Secondary NameNode is responsible to combine the FsImage and Edit Logs together from the NameNode.
- Secondary NameNode perform a regular checkpoint in the HDFS. So, It is also referred to as checkpoint NameNode.
What is Checkpoint Node?
Checkpoint node is the node that creates the Namespace checkpoints on a periodical basis. Hadoop Checkpoint node will first download the Edits and FsImage from the Active Namenode and then it combines the both ( EditLogs and FsImage). At last, it will upload the new image to the NameNode. It maintains the latest checkpoint in the directory whose structure is similar to the directory of NameNode. This allows the checkpoint image to be present always for NameNode for reading if required.
[Related Page: Leading Hadoop Vendors in BigData]
What is a Backup Node?
Backup Node in HDFS provides similar checkpointing functionality as the Checkpoint node provides. In the Hadoop, the Backup node is used to store an in-memory and updated file system namespace copy. The backup node is always synchronized with an active NameNode space.
In the HDFS Architecture, Backup Node does not require to download the edit files and FsImage from the active NameNode for creating the checkpoint. It has already an updated state of name state in the memory. The process of Backup Node checkpoint is more efficient it as it only requires to save the namespace into the local FsImage files and reset the edits.
NameNode supports only a single Backup Node at a particular interval of time.
What are Blocks in HDFS Architecture?
As we all know that, in HDFS Architecture Data is stored on the different Data blocks which are known as Blocks.
A block is the smallest continuous location present on the hard drive where you have stored the data. In other words, In the file system, data is stored as the collection of blocks. Similarly, in the HDFS, files are stored in the form of blocks which are shared across the Apache Hadoop Cluster.
By default, the size of every Block is 128 MB in the Apache Hadoop 2.x ((64 MB in Apache Hadoop 1.x). You can configure this according to your requirements.
It is not mandatory that, every file in HDFS is stored in the exact multiples of configured block size like 128 MB, 256 MB and many more.
[Related Page: Hadoop Jobs Salary Trends in the USA ]
For example - You have a file called 'example.txt' of size 514 MB as displayed in the picture below and you use the default configuration of block size i.e 128 MB.
A number of blocks in the file: First four blocks will of size- 128 MB and the size of the last block will be of 2MB size only.
Now you might be thinking that why you require to have a big Block size of 128 MB?
If you talk about the HDFS, you cannot ignore the big data sets which have data in Terabytes or Petabytes of data. So, let's say, you have a block size of 4 KB, in the Linux File System. In this case, you will be having too much data and a huge number of blocks. thus, it will become very difficult to manage these data and blocks. Thus, a Big block size is required.
[Related Page: Hadoop HDFS Commands]
What is Replication Management?
HDFS architecture offers the most efficient medium of storing the huge amount of data in the distributed environment in the form of data blocks. These blocks are then replicated for providing fault tolerance.
By default, Replication Factor = 3. Thus, you can configure it.
Below the figure is provided in which every block is replicated three times and stored on the different data node ( it considers the default replication factor).
Thus, if you store a file of size 128 MB in the HDFS by using the default configuration, you will consume the space up to 3*128 MB which is equal to 384 MB. The replication of each block is done three time and each replica resides on the different node.
Important Note- The NameNode collects the Block report from DataNode on regular basis for maintaining the replication factor. Thus, In case, the block is under-replicated or over-replicated the NameNode add or delete the replicas according to the requirement.
[Related Page: Cloudera Hadoop Certification]
What is Rack Awareness in HDFS Architecture?
In a large Hadoop cluster, for managing the traffic on network during the read/write operations in the HDFS file, NameNode selects the DataNode that is present closer to the similar or nearby rack to Read/Write request.
The Rack information is extracted by the NameNode through maintaining the rack ids of every DataNode. Hadoop Rack Awareness is the concept with select the DataNodes according to the rack information
NameNode always makes sure that all the replicas are not being stored in the single rack or the similar rack. It follows the In-built Rack Awareness Algorithm for reducing the latency and providing fault tolerance.
[Related Page: What is Pig Latin Hadoop?]
As the replication factor is 3, according to the Rack Awareness Algorithm, first block replica will be stored on the distinct (remote) rack but, on the different DataNode present within that (remote) rack as mentioned in the figure given below. In case, there are more replicas, the left replicas will be placed on some random Data Nodes provided not more than 2 replicas residing on a similar rack ( if it is possible).
This is the actual representation of Hadoop production cluster. Here, the user has multiple racks populated with different DataNodes.
Rack Awareness is necessary for improving:
- Data reliability and high availability.
- The cluster performance.
- Network Bandwidth.
[Related Page: Reasons to Learn to Hadoop & Hadoop Administration]
Advantages of Rack Awareness
Following are the reasons why you need the Rack Awareness Algorithm:
- For improving the network performance: The interaction between the nodes present on the distinct racks is directed through a switch. In other words, there will be increased bandwidth between the Machines present on a similar rack as compared to the Machines present on the distinct racks. Thus, the Rack Awareness allows you to have reduced right traffic between the different racks and provide you an improved write performance. In addition, you will be having an improved read performance as you are using the bandwidth of different racks.
- For preventing the data loss: There is no need to get stressed about the data even if the entire rack failure occurs as a due to a power failure or a switch failure as the Rack Awareness Architecture will manage them all.
[Related Page: Hadoop Ecosystem]
HDFS Read/Write Operation
When the client wants to write a file to HDFS Architecture, it will interact with the NameNode for metadata. After that, NameNode reverts back with the block numbers, replicas, their location, and the other information. Depending upon the information from the NameNode, the client divides the file into different blocks. Later, the client starts sending the blocks to the first DataNode.
First of all, the client will send the Block A to DataNode 1 with the details of the other two DataNodes. When this block is delivered to the DataNode 1, the DataNode 1 share this block with DataNode 2 present on this rack.
As both the DataNodes are present in a similar rack, so the block transfer is done via a switch. Now, the DataNode 2 will copy the same block to the DataNode 3.
As both of the DataNodes are present in the different racks, block transfer takes place through the out-of-rack switch.
When the DataNode receives the block from the client, it sends a write confirmation to the NameNode.
This process is repeated for every block created for the file.
[Related Article: Big Data Analytics]
To read from the HDFS Architecture, first of all, the client interacts with the NameNode for metadata. The client gets the details like files name and their location from the NameNode.
The NameNode reverts with the number of blocks, replicas, their location, and the other information.
Now the client will interact with the DataNodes. The client will start reading the data from the DataNodes depending upon the information received from the NameNodes. After receiving all the blocks of the file, declined or application will combine these blocks together in the form of the original file.
[Related Page: Hadoop Installation and Configuration]
HDFS key features
Following are the Key features supported by the HDFS:
- Fault tolerance: In Hadoop HDFS, fault tolerance helps to strengthen the system by tackling the unfavorable conditions. HDFS Framework divides the data into different blocks and after that, it creates multiple copies of the different blocks and shares it with different machines present in the cluster. So, if any machine stops working, the client will easily able to access the data from another machine which contains similar data blocks copy.
- High availability: Hadoop Architecture is a highly available file system. Data replication is done among the nodes present in the Hadoop cluster by making a replica of blocks on the other Data Nodes present in the HDFS cluster. Thus, when a client wants to access the data, they can access it easily through Data Nodes which contains its blocks. During the unfavorable conditions like Node failure, a user can access the data easily from some other node. This is because of the replicated copies of blocks at present on the different nodes.
- Highly reliable: HDFS offers highly reliable storage of data. It is capable of storing data in the range of hundreds of petabytes. Data Storage is done on the cluster in a very reliable manner. Division of data is done into the blocks. Hadoop framework stores the blocks on the nodes present in the HDFS cluster.
- HDFS stores the data in a very reliable manner by creating the replica of blocks available in the cluster. Thus, it has a high fault tolerance. If one node fails, data can be accessed from the other node. By default, HDFS creates the 3 replicas of every block available in the nodes. Thus, the data is easily available for the users.
- Replication: Data Replication resolves the data loss issue in case of node crashing and hardware failure. HDFS maintains the replication process at regular intervals as creates the replicated copies of data on the different machines.
- Scalability: As HDFS stores the data on multiple nodes present in the cluster. So, if any requirement comes, a client can scale the cluster. The two scalability mechanism are followed: Vertical Scalability and Horizontal Scalability.
- Distributed data storage: All the features are accessible through distributed storage as in HDFS data is stored in the distributed manner across the different nodes. Data division is done in the form of blocks and stored on the nodes. So, easily accessible by the user.
[Related Page: MapReduce In Bigdata ]
The aforementioned information is providing the complete details about HDFS Architecture. With the help of NameNode and DataNode, a large amount of data and files can be stored in a very reliable manner across the different machines present in a big cluster. Thus, it becomes easy for the organization to manage its big amount of data without compromising its resources. HDFS stores the files in a sequence of blocks and the block replication helps in the tolerance of faults. It offers higher data availability, as the data is available all the time despite any hardware failure. If any hardware or machines crash, a user can access the data through some other path.