A file storage framework allows storing files using the backend of the document library. In this article, we would be talking about What is HDFS (Hadoop Distributed File System), a popular file storage framework that offers massive storage for all types of data that can handle limitless tasks. So, let's get started....
Looking forward to becoming a Hadoop Developer? Check out the Big Data Hadoop Certification Training and get certified today.
In this article we will discuss the following components briefly which will help you to understand the What is HDFS system better:
HDFS stands for Hadoop Distributed File system. It is nothing but a basic component of the Hadoop framework. It is capable of storing and retrieving multiple files at the same time.
HDFS is one of the prominent components in Hadoop architecture which takes care of data storage. The storage system for Hadoop is actually spread across various machines to reduce the cost and also increase the reliability factor.
In this section of the article, we will discuss the HDFS architecture in detail.
The above figure demonstrates the entire components of NameNodes and the connections between the NameNode and associated DataNodes.
Functions of the NameNode:
Functions of DataNode:
So far we have gone through the different functions that are carried out within NameNode and DataNodes. The entire execution happens via the NameNode, in this case, if there is a major flaw observed in the NameNode the entire system will be down.
In order to manage this situation, we have a concept called “Secondary NameNode”. With the use of secondary NameNode, the operations will go smoothly and won't result in system unavailability.
In this section, we will go through the concept of “Secondary NameNode” and the functions associated.
The process of downloading the latest FsImage from the NameNode is completed first. Then the contents of the FsImage is updated in the FsImage folder of the Secondary NameNode. The process is displayed in the above screenshot attached.
Read these latest Hadoop Interview Questions that helps you grab high-paying jobs!
All the information is stored in the form of blocks and is internally available within the DataNodes.
The above screenshot shows that a file of 514 MB is stored in the form of blocks. To store the file, the HDFS system has created 5 individual blocks where the data is scattered. The blocks are created in such a way that the entire file size is divided by the default block size ( in this case it is 128 MB).
So the 514 MB file is divided into 4 blocks of 128 MB capacity and the remaining 2 MB is allocated to a separate block.
In this section of the article, we will go through the replication management process. The replication management process works based on the inputs received from the DataNode.
The data blocks are replicated three times and they are stored on different DataNodes. In this case, data blocks are replicated thrice because of its default setting. The block representation is shown in the above screenshot.
So in general, if you are storing 128 MB file in HDFS system, then using its default replication factor, the system will end up occupying 384 MB of space (i.e. 3*128 MB).
Each and every replica resides in different Data Node.
The Name Node collects all the information about blocks from DataNode on a regular basis and maintains the replication factor. So if the blocks are under-replicated or over-replicated, the NameNode comes into the picture and deletes or add’s the replicas as needed.
[Related Article: Hadoop HBase Schema and Versioning]
We already know that NameNode collects all the information related to the blocks from DataNode. The blocks are replicated based on the NameNode. In the same manner, NameNode makes sure that the replicas are not saved on the same racks or a single rack. The replicas are allocated based on the Rack Awareness Algorithm.
The Rack Awareness Algorithm actually reduces the latency and provides fault tolerance. As per the default replication factor count, i.e. “3”. The first replica of the block is stored on the local rack and the next two replicas are stored on different racks which are on a different DataNode. The same is displayed in the image below.
If you end up having more block replicas than the default number then the replicas are stored on different racks and no two replicas are stored on the same rack.
The below image shows the typical structure of Hadoop production cluster. In this structure you can see that it has multiple racks which are populated with DataNodes.
The Hadoop cluster is connected to the Rack Switches via Core Switches. The connection is shown in the above screenshot.
In this section of the article, we will go through the advantages of the Rack Awareness Algorithm in detail.
|Network performance is improved||
|Prevents data loss||
In this section of the article, we will discuss the Read and Write operations briefly.
HDFS follows write once and read many models.
So the file that is written once to the HDFS system cannot be modified, but the file can be accessed multiple times to view the information.
Now let us go through an example where it depicts, the process of writing a file into the HDFS system.
Let us consider a situation where the user needs to write a file of 248 MB into the HDFS system.
The file is named as “example.txt”
The file size is 248 MB
So to store the 248 MB file, the data blocks will be created.
As per the default setting, a block size is 128 MB. So approximately two data blocks would be needed to store this information.
Block A will accommodate 128 MB
Block B will accommodate the rest, i.e. 120 MB
[Related Article: Hadoop with BODS Integration]
In the above screenshot, you can observe that the two blocks, i.e. Block A and Block B are stored into different DataNodes which are on different Racks. The entire setup of Rack allocation and DataNode allocation happens within the system.
[Related Article: Hadoop Installation and Configuration]
In the above screenshot, you can observe that the request of the information (i.e. DataNodes and blocks) are provided by NameNode. Also, while retrieving information, the DataNodes and the blocks which are close to the switch will be prompted first. By following this process, the data will be transmitted first and the data streaming happens quickly.
[Related Article: An Overview Of Hadoop Hive]
In this section of the article, we will discuss the assumptions and goals of the HDFS system.
[Related Article: HBase Vs RDBMS]
All the applications that are on the HDFS system need data streaming access so that the data can be continuously streamed. Unlike, the traditional applications, the data is not accessed based on the user inputs.
[Related Article: Introduction to HBase for Hadoop ]
In this section of the article, we will discuss the File System within the HDFS system and understand the core points of managing the File System.
The below are the key points where the entire file system Namespace is managed:
In this section of the article we will discuss the concepts that are associated with Data replication.
[Related Article: Apache Pig User Defined Functions ]
[Related Article: Hadoop Apache Pig Execution Types ]
In this section of the article, we will discuss the robustness of the HDFS system and the different types of failures that exist in a system.
The one and only goal of an HDFS system is to store the data appropriately in the case of system failures. The common types of failures that are seen within HDFS System are classified as:
The below are the different types of failures that can occur within a HDFS system:
[Related Article: What Is Hadoop Hive Query Language]
[Related Article: Hadoop Hive Data Types with Examples]
In this section of the article, we will go through the key aspects and activities of data organization within an HDFS system.
The HDFS systems are designed so that they can support huge files. The applications generally write the data once but they read the data multiple times. As the files are accessed multiple times, so the streaming speeds should be configured at a maximum level. The typical block size used by the HDFS system is about 64MB.
Staging is an intermediary environment where the files are stored on a temporary basis. All the files that we deal with the HDFS system are huge in size. So to make sure the writing process is done perfectly, the staging environment is used.
For every NameNode, it has replication factor data available. Based on the replication factor count, the DataNodes are created. For example, the replication factor count is “3” then 3 DataNodes are created. The data will be passed from DataNode 1 to DataNode 2 and then finally DataNode 3. So the process of data flow is considered as a pipeline process, where the data flows from one DataNode to another DataNode and the end of the DataNode is defined by the replication factor count.
[Related Article: Hadoop Sqout Usage]
In this section of the article, we will discuss the different approaches to access an HDFS system.
[Related Article: Apache Hadoop Sqoop]
In this section of the article, we will go through the concepts of “file deletes” and “decrease the replication factor”.
In this section of the article, we will discuss in detail about the key features that are provided by HDFS system.
|HDFS key features||Description|
|Bulk data storage||The system is capable of storing terabytes and petabytes of data. It is known for its data management and processing.|
|Minimum Intervention||Without any operational glitches, the Hadoop system can manage thousands of nodes simultaneously.|
|High Computing skills||Using the Hadoop system, developers can utilize distributed and parallel computing at the same point.|
|Scaling out||The Hadoop system is defined in such a way that it will scale out rather than scaling up. The entire process is managed perfectly so there is no downtime assigned.|
|Rollback||On upgrade, the system allows the users to return to the previous version post.|
|Data integrity||The system automatically deals with corrupted data.|
|Communication||The servers are communicated and connected via TCP-based protocols|
|Files systems||The Hadoop system is meant for huge databases, but it can also handle normal file systems, i.e. FAT, NTFS|
We are now living in a data-driven world, where most of our day to day activities revolve around creating data or accessing the data. So to have a perfect data management and data retrieval, a tool is definitely needed. To fulfill the needs, especially when the data is available in large sizes, the HDFS system actually resolves the problem. With the use of systems like Hadoop, the entire data management process has become really easy and effective.
|Big Data On AWS||Informatica Big Data Integration|
|Bigdata Greenplum DBA||Informatica Big Data Edition|
|Hadoop Testing||Apache Mahout|
Technical Content Writer