Introduction to HDFS (Hadoop Distributed File System)
What is HDFS?
HDFS stands for Hadoop Distributed File system. It is nothing but a basic component of the Hadoop framework. It is capable of storing and retrieving multiple files at the same time.
HDFS is one of the prominent components in Hadoop architecture which takes care of data storage. The storage system for Hadoop is actually spread across various machines to reduce the cost and also increase the reliability factor.
In this article we will discuss the following components briefly which will help you to understand the HDFS system better:
As we have already discussed what is HDFS before the table of contents, let’s move to the architecture section.
Introduction to HDFS Architecture:
In this section of the article, we will discuss the HDFS architecture in detail.
- HDFS is a block-structured file system. Within this system, every file is divided blocks.
- All the blocks are stored within one cluster or multiple clusters.
- HDFS takes Master/Slave architecture approach.
- Within this architecture, the NameNode is nothing but the (Master Node) and all the DataNodes are considered as slaves.
The above figure demonstrates the entire components of NameNodes and the connections between the NameNode and associated DataNodes.
- The NameNode is considered as the master node in the HDFS Architecture.
- It is responsible to manage all other DataNodes
- It has control access to the files by different clients.
- The architecture is defined in such a way that the user data is stored in DataNodes only.
Functions of the NameNode:
- All the metadata information is stored in the cluster. ( For example, location of the blocks, the file size information, file permissions, hierarchy of the data, etc.
- The metadata information is stored specifically in FsImage and EditLogs.
- All the transactions or the activities are stored within the NameNode. For example, if a file is deleted within HDFS. This information is tracked in EditLog.
- The NameNode takes complete responsibility of replication factor.
- In the case of DataNodes failure, the NameNode manages to choose a new DataNode for replicas and manages the communication traffic accordingly.
[Related Page: Mapreduce In Bigdata ]
- DataNodes is nothing but slaves in the HDFS system.
- A DataNode is nothing but a server which is primarily used to store data.
- The data is stored in the local file ext3 and ext4.
Functions of DataNode:
- All the low level read operations and write operations requests from various clients will be performed on DataNodes
- The DataNodes are responsible to send out periodic information related to the overall health report of HDFS. The default replication factor is set to “3”.
- All the data is stored in the data nodes only.
- DataNodes act as a salve system within the HDFS architecture.
So far we have gone through the different functions that are carried out within NameNode and DataNodes. The entire execution happens via the NameNode, in this case, if there is a major flaw observed in the NameNode the entire system will be down.
In order to manage this situation, we have a concept called “Secondary NameNode”. With the use of secondary NameNode, the operations will go smoothly and won't result in system unavailability.
[Related Page: Abstract of MapReduce]
In this section, we will go through the concept of “Secondary NameNode” and the functions associated.
- The secondary NameNode works in parallel with the primary NameNode.
- Secondary NameNode is not actually a backup NameNode.It works simultaneously and all the information is available in both Primary NameNode and the secondary NameNode.
- Using the two NameNodes, the data will be available all the time.
The process of downloading the latest FsImage from the NameNode is completed first. Then the contents of the FsImage is updated in the FsImage folder of the Secondary NameNode. The process is displayed in the above screenshot attached.
- All the activities from EditLogs are combined with FsImage.
- Whenever the NameNode restarts, the copy of FsImage is downloaded and copied back to the NameNode.
- At regular intervals, the secondary NameNode executes regular checkpoints within HDFS. This concept of having periodic checkpoints is nothing but CheckpointNode.
[Related Page: Hadoop Heartbeat and Data Block Rebalancing]
All the information is stored in the form of blocks and is internally available within the DataNodes.
- Blocks are nothing but a specific area on the hard drive where the information is stored.
- If you with any standard File systems, all the data is stored in the form of Blocks.
- The default size of a block is 128 MB in Hadoop 2.x and it is about 64MB in Hadoop 1.x
- The size can be configured based on the need
The above screenshot shows that a file of 514 MB is stored in the form of blocks. To store the file, the HDFS system has created 5 individual blocks where the data is scattered. The blocks are created in such a way that the entire file size is divided by the default block size ( in this case it is 128 MB).
So the 514 MB file is divided into 4 blocks of 128 MB capacity and the remaining 2 MB is allocated to a separate block.
[Related Article: Understanding Data Parallelism in MapReduce]
In this section of the article, we will go through the replication management process. The replication management process works based on the inputs received from the DataNode.
- The HDFS system actually stores huge amount of data within a distributed environment in the form of data blocks.
- The blocks are replicated which will eventually provide fault tolerance.
- The replication factor is configurable, but the default replication factor is “3”.
The data blocks are replicated three times and they are stored on different DataNodes. In this case, data blocks are replicated thrice because of its default setting. The block representation is shown in the above screenshot.
[Related Article: MapReduce in Geographically Distributed Environments]
So in general, if you are storing 128 MB file in HDFS system, then using its default replication factor, the system will end up occupying 384 MB of space (i.e. 3*128 MB).
Each and every replica resides in different Data Node.
The Name Node collects all the information about blocks from DataNode on a regular basis and maintains the replication factor. So if the blocks are under-replicated or over-replicated, the NameNode comes into the picture and deletes or add’s the replicas as needed.
[Related Article: Hadoop HBase Schema and Versioning]
We already know that NameNode collects all the information related to the blocks from DataNode. The blocks are replicated based on the NameNode. In the same manner, NameNode makes sure that the replicas are not saved on the same racks or a single rack. The replicas are allocated based on the Rack Awareness Algorithm.
What does Rack Awareness Algorithm do?
The Rack Awareness Algorithm actually reduces the latency and provides fault tolerance. As per the default replication factor count, i.e. “3”. The first replica of the block is stored on the local rack and the next two replicas are stored on different racks which are on a different DataNode. The same is displayed in the image below.
If you end up having more block replicas than the default number then the replicas are stored on different racks and no two replicas are stored on the same rack.
The below image shows the typical structure of Hadoop production cluster. In this structure you can see that it has multiple racks which are populated with DataNodes.
The Hadoop cluster is connected to the Rack Switches via Core Switches. The connection is shown in the above screenshot.
[Related Article: Controlling Hadoop Jobs Using Oozie]
Advantages of Rack Awareness
In this section of the article, we will go through the advantages of the Rack Awareness Algorithm in detail.
- Improves the network performance
- Prevents data loss
|Network performance is improved
- The communication between the Node which is residing on different racks are connected via a Switch.
- Helps in reducing the write traffic between different racks, resulting in better write performance.
|Prevents data loss
- The data is available in different replicas on different racks. So in the event of power failure or switch failure, all of your data is available.
HDFS Read/ Write Architecture
In this section of the article, we will discuss the Read and Write operations briefly.
HDFS follows write once and read many models.
So the file that is written once to the HDFS system cannot be modified, but the file can be accessed multiple times to view the information.
Now let us go through an example where it depicts, the process of writing a file into the HDFS system.
[Related Article: Hadoop Job Operations]
HDFS Write Architecture
Let us consider a situation where the user needs to write a file of 248 MB into the HDFS system.
The file is named as “example.txt”
The file size is 248 MB
So to store the 248 MB file, the data blocks will be created.
As per the default setting, a block size is 128 MB. So approximately two data blocks would be needed to store this information.
Block A will accommodate 128 MB
Block B will accommodate the rest, i.e. 120 MB
[Related Article: Hadoop with BODS Integration]
- The HDFS client will initially check with the NameNode and seeks to write request for two blocks, i.e. Block A and Block B.
- The NameNode will provide the write permission and the IP address of the DataNodes are provided.
- The availability of the DataNode and the IP address is completely based on the availability and replication factor.
- If the replication factor is set to default, i.e. ‘3”, then each block will be replicated three times and stored in different DataNodes.
- Once the data is written, the same data will be replicated into other DataNodes. The data copy process is executed in three different steps:
- Set up of pipeline
- Data streaming and replication
- Shutdown of pipeline
In the above screenshot, you can observe that the two blocks, i.e. Block A and Block B are stored into different DataNodes which are on different Racks. The entire setup of Rack allocation and DataNode allocation happens within the system.
[Related Article: Hadoop Installation and Configuration]
HDFS Read Architecture:
- Based on the file requirement, the client will reach out to the NameNode and seek for the metadata information.
- After the request is fulfilled at the master level, the NameNode will then provide the information back with the relevant DataNodes and blocks.
- The client will then get in touch with the relevant DataNodes to access the information.
- The information is pulled parallelly from different DataNodes and once the requested information is successfully collected, it is now combined to form the actual file.
- Once the file is prepared, it will be up for review and the user will be able to see the relevant data.
In the above screenshot, you can observe that the request of the information (i.e. DataNodes and blocks) are provided by NameNode. Also, while retrieving information, the DataNodes and the blocks which are close to the switch will be prompted first. By following this process, the data will be transmitted first and the data streaming happens quickly.
[Related Article: An Overview Of Hadoop Hive]
Assumptions and Goals
In this section of the article, we will discuss the assumptions and goals of the HDFS system.
- It is not an exception scenario, it is a standard activity that can happen at any point in time.
- Within an HDFS system, they are several thousands of server machines are available which are primarily used to store the data.
- It is bound to see some hardware failure within the HDFS system. So to make sure that there is no prominent downtime and take less recovery time is a fundamental architecture goal of HDFS system.
[Related Article: HBase Vs RDBMS]
Streaming Data Access
All the applications that are on the HDFS system need data streaming access so that the data can be continuously streamed. Unlike, the traditional applications, the data is not accessed based on the user inputs.
Large Data Sets
- The applications that are executed on the HDFS system are fine-tuned to access a large number of data sets. In the real world, the applications that are on HDFS system access around ten million files from a single instance, so to support this amount of data transfer, the applications should be fine-tuned.
- A typical file in HDFS system is about gigabyte to a terabyte in size.
Simple Coherency Model
- The application that is on HDFS system is configured in such a way that they support write once and read many access model. I.e. If a file is created, written and closed cannot be changed at later point. The modifications to the file cannot be done but the user will have only read access.
- For example, a web crawler or a MapReduce application fits perfectly into this model.
Moving computation is cheaper than Moving Data
- The computation logic should be moved near to the data repository so that it reduces the network congestions and also the response time to fetch the necessary data. This is needed because most of the data in HDFS is huge.
- So the assumption of moving the computation logic near to the data source is definitely a good aspect.
- The provision of moving computation logic near to the data source is available in HDFS interface.
[Related Article: Introduction to HBase for Hadoop ]
Portability across hardware and software platforms
- The HDFS system is designed in such a way that they are easily portable from one platform to another platform without any issues or delays.
- With this facility, the HDFS systems are adopted easily.
The File System Namespace
In this section of the article, we will discuss the File System within the HDFS system and understand the core points of managing the File System.
- The HDFS system supports the traditional hierarchical file organization where the user or the application can create folders and then stores files within the folders.
- The file system is the same when compared to the other existing file systems in the market.
- A typical file system allows the functionalities like, a user can create and delete files from one folder to another folder, etc.
- The filesystem namespace is maintained or managed by NameNode.
- If there are any changes in the file system namespace then they are recorded in NameNode.
- The number of replicas for a file, i.e. replication factor number information is available in NameNode only.
[Related Article: Advantages of Hadoop MapReduce Programming]
The below are the key points where the entire file system Namespace is managed:
In this section of the article we will discuss the concepts that are associated with Data replication.
- The HDFS systems are built in such a way that they can handle very large files within the clusters.
- All the files are stored in sequence of blocks.
- For fault tolerance, the blocks are replicated.
- Per individual file, the block size and the replication factor can be configured.
- The replication factor number is defined at the time of file creation. Although, this information can be changed at a later point in time.
- All the files within the HDFS system follow write-once policy. Also, only one writer is allowed for a single file.
- All the data replication information can be accessed from NameNodes.
[Related Article: Apache Pig User Defined Functions ]
Replica Placement: The First Baby Steps
- In this section of the article, we will discuss in detail about Replica placement and its importance in file management systems.
- Within the HDFS system, the placement of a replica is vital, because if it not done properly then it will affect the reliability and also the performance.
- The replica placement plays an important in terms of differentiating an HDFS system vs another file management systems.
- With the help of rack-aware replica placement process, the systems are improved a lot in terms of data availability, reliability, and utilization.
- In order to reduce the global bandwidth consumption and also the latency concerns, HDFS system tries to get in touch with the closest possible replica. By following this process, the read request is fulfilled in less span of time. This is a perfect example of resource utilization.
[Related Article: Pig Latin Hadoop]
- The NameNode enters into a special state which is called as Safemode.
- When the NameNode is in the Safe mode, the data blocks are not replicated.
- The NameNode receives frequent updates regarding the blog via from DataNode.
- A block is perfectly replicated when it satisfies the minimum number of replicas that is set in the NameNode.
The Persistence of File System Metadata
- The HDFS namespace is actually stored within the NameNode.
- All the changes or the transaction logs are tracked within a file called Editlog. Using this log, each and every change that has happened to the file system metadata can be tracked.
- For example: If the user creates a new file in HDFS results in, a new record is inserted into NameNode. Further, if there is a change in the replication factor number, the NameNode has to record another log entry in the Edit log file.
- The file system namespace, including the blocks of files, system properties, all of these details are stored within a file called FsImage.
- Whenever the NameNode starts, it reads through the FsImage and the Editlogs. All the transactions that were recorded in the Editlogs are then flushed out to the in-memory representation of the FsImage and ultimately pushes the latest changes to the FsImage.
- Once the latest transactions are pushed to the FsImage, all the transactions in the Editlogs are truncated. This process of applying transactions to the FsImage and truncating the old transactions is referred to as “checkpoint”.
[Related Article: Hadoop Apache Pig Execution Types ]
- The HDFS communication protocols are carried out on the TCP/IP protocol.
- A client establishes an active connection to the TCP/IP port on the NameNode machine.
- The communication happens via ClientProtocol with the NameNode.
- Using the DataNode protocol, the DataNodes communicate to the NameNode.
- A Remote Procedure Call (RPC) combines the Client Protocol and the DataNode protocol.
- The NameNode only respond back to the RPC requests which are triggered by DataNodes or clients.
In this section of the article, we will discuss the robustness of the HDFS system and the different types of failures that exist in a system.
The one and only goal of an HDFS system is to store the data appropriately in the case of system failures. The common types of failures that are seen within HDFS System are classified as:
- Failure due to NameNode malfunctioning
- Failure due to Data Node malfunctioning
- Failure due to network issues and partitions.
[Related Article: MapReduce Implementation in Hadoop]
The below are the different types of failures that can occur within a HDFS system:
Disk Failure, Heartbeats, and Re-replication:
- Every DataNode sends out a heartbeat message to the available NameNodes at regular intervals. If the heartbeat messages are not received then the connection between a DataNode and the NameNode is lost.
- If the connection is lost between the DataNode and the NameNode then no IO requests are transferred.
- If the connection is lost between the DataNode and the NameNode then there is a possibility of having less number of replication factor of the blocks than the specified number in the NameNode.
- The need for re-replication comes into picture for many known reasons like:
- The DataNode is unavailable
- The Replica might be corrupted
- The hard disk on the DataNode may fail
- The replication factor for a file might have increased
[Related Article: How to Insert Data into Tables from Queries in Hadoop ]
- The current HDFS architecture is suitable for certain data rebalancing activities which can be done dynamically to fulfill the data needs.
- With the data rebalancing activities, the data can be automatically moved from one DataNode to another DataNode based on the available free space.
- Whenever a particular file is in high demand, the data rebalancing scheme will start and replicate the data and make sure that the cluster is rebalanced.
- In a few cases, we get to see a possibility of the data from the blocks that are fetched from the DataNodes are corrupted. The data corruption can happen because of multiple failures like:
- Faults in the storage devices
- Network Faults
- Buggy software
- If the data that is retrieved by the client is observed as corrupted data then the same data can be accessed from other available DataNodes.
[Related Article: What Is Hadoop Hive Query Language]
Metadata Disk Failure:
- Within the HDFS system, FsImage and Editlogs are considered as the central data structures of the system. If these two files are corrupted then the HDFS system can be nonoperative.
- In order to mitigate the above reason, the NameNode is configured in such a way that they have multiple copies of FsImage and Editlogs.
- If there is an update available for these files, all the copies of FsImage and Editlogs happen synchronously.
- The synchronous update will definitely slow down the performance of the HDFS system but it is acceptable because the HDFS systems are data intensive oriented and not metadata intensive.
- So when the NameNode restarts, it will consider the latest FsImage and Editlogs.
- The NameNode is one and an only primary factor for HDFS cluster failure.
- If the NameNode machine is faulty, then manual intervention is definitely needed.
- Snapshots is a very useful feature where a certain part of the data is copied for a specific instant of time.
- This feature is very useful at the time of rolling back a corrupted HDFS instance.
[Related Article: Hadoop Hive Data Types with Examples]
In this section of the article, we will go through the key aspects and activities of data organization within an HDFS system.
The HDFS systems are designed so that they can support huge files. The applications generally write the data once but they read the data multiple times. As the files are accessed multiple times, so the streaming speeds should be configured at a maximum level. The typical block size used by the HDFS system is about 64MB.
Staging is an intermediary environment where the files are stored on a temporary basis. All the files that we deal with the HDFS system are huge in size. So to make sure the writing process is done perfectly, the staging environment is used.
For every NameNode, it has replication factor data available. Based on the replication factor count, the DataNodes are created. For example, the replication factor count is “3” then 3 DataNodes are created. The data will be passed from DataNode 1 to DataNode 2 and then finally DataNode 3. So the process of data flow is considered as a pipeline process, where the data flows from one DataNode to another DataNode and the end of the DataNode is defined by the replication factor count.
[Related Article: Hadoop Sqout Usage]
In this section of the article, we will discuss the different approaches to access an HDFS system.
- HDFS system can be accessed by various means.
- By default, HDFS system provides a Java API where the applications can be connected.
- With the use of “C” language wrapper is available to access the HDFS system via Java API
- To browse through the files within an HDFS instance, an HTTP browser is available.
- The user data is organized by categorizing the data into files and directories.
- The HDFS system allows the user data to be organized.
- With the use of command line interface, the user can interact with the HDFS system, The command line interface is called as “FS shell”.
- The syntaxes are similar to that of other Shell commands, it is easy for the users to interact with the system and review the data.
[Related Article: Google’s MapReduce Programming Model ]
- The command set of DFS admin is used to manage the HDFS cluster.
- The commands can be executed only by DFS administrator only.
- So having admin credentials is very important.
- At the time of HDFS system installation, a web server is also installed which is then exposed to the HDFS namespace via TCP port. The TCP port is a configurable port.
- Using this TCP port, the user will be able to view the files and the contents via a web browser.
[Related Article: Apache Hadoop Sqoop]
In this section of the article, we will go through the concepts of “file deletes” and “decrease the replication factor”.
- Whenever a file is deleted by a user then it is not removed from the HDFS system immediately.
- After deleting the file, the same file is available in the /trash directory.
- Within the /trash directory, the user will be able to configure the time. So based on the time set for the /trash directory, the file will be available.
- After expiry, the file will be then removed from the HDFS system.
- It takes a considerable amount of time to delete the data from the HDFS system.
- Once the file has crossed the expiry time within the /trash directory, the file cannot be retrieved.
[Related Article: Hadoop MapReduce API]
Decrease the Replication Factor:
- The replication factor for a particular file can be reduced.
- Once the value is reduced, the NameNode will select the excess replicas for the same file so that it can be deleted.
- The information is passed on to the DataNode and then the DataNodes will start deleting the relevant blocks.
- It takes a considerable amount of time to delete the data and at the same time, it will take time to showcase the free space.
Frequently Asked MapReduce Interview Questions & Answers
HDFS Key Features
In this section of the article, we will discuss in detail about the key features that are provided by HDFS system.
- The HDFS system is fully scalable.
- Provides a storage system for the big data platform.
- Provides data management, data processing and data analytics functionalities
|HDFS key features
|Bulk data storage
||The system is capable of storing terabytes and petabytes of data. It is known for its data management and processing.
||Without any operational glitches, the Hadoop system can manage thousands of nodes simultaneously.
|High Computing skills
||Using the Hadoop system, developers can utilize distributed and parallel computing at the same point.
||The Hadoop system is defined in such a way that it will scale out rather than scaling up. The entire process is managed perfectly so there is no downtime assigned.
||On upgrade, the system allows the users to return to the previous version post.
||The system automatically deals with corrupted data.
||The servers are communicated and connected via TCP-based protocols
||The Hadoop system is meant for huge databases, but it can also handle normal file systems, i.e. FAT, NTFS
We are now living in a data-driven world, where most of our day to day activities revolve around creating data or accessing the data. So to have a perfect data management and data retrieval, a tool is definitely needed. To fulfill the needs, especially when the data is available in large sizes, the HDFS system actually resolves the problem. With the use of systems like Hadoop, the entire data management process has become really easy and effective.
Explore MapReduce Sample Resumes! Download & Edit, Get Noticed by Top Employers!Download Now!
List of Big Data Courses:
Subscribe For Free Demo
Free Demo for Corporate & Online Trainings.