Hadoop Archive Files In HDFS

Hadoop Archive

•  HDFS Shares small files in efficiently, since each file is stored in a block and block meta data is held in memory by the Name Node.
•  Thus, a large number of small files can take a lot of memory on the Name Node for example, 1 MB file is stored with a block size of 128 MB uses 1.MB of disk space not 128 MB.
•  Hadoop Archives or HAR files, are a file archiving facility that packs files into HDFS blocks more efficiently, there by reducing Name Node memory usage while still allowing transparent access to files,
•  Hadoop Archives can be used as input to map reduce.

Inclined to build a profession as Hadoop Developer? Then here is the blog post on HADOOP TRAINING ONLINE.

Using Hadoop Archives:
→ A Hadoop Archives is created from a collection of files using the archivetool, which runs a map reduce job to process the input files in parallel and to run it, you need a map reduce cluster running to use it.
→ Here are some files in Hadoop Distributed File System (HDFS) that you would like to archieve:
%hadoop fs-lsr/my/files.
→ Now we can run the archivecommand:
% hadoop archive–archivename files.har/my/files/my
HRA files always have a .har extension which is mandatory
→ Here we are achieving only one source here, the files in /my/files in HDFS, but  the tool accepts multiple source trees and the final argument is the out put directory for the HAR file
→ The archive created for the above command is
%hadoop fs-ls/my
Found 2 items
Drwr-x-x – tom super group 0 2009-04-09 19:13/my/files
Drwr-x-x – tom super group 0 2009-04-09 19:13/my/files hor
→ In HDFS, Blocks are replicated across multiple machines known as data nodes and default replication is three – fold i.e. each block exists on three different machines
→ A master node called the Name Node keeps tracker of which blocks make up a file and where those blocks are located known as the meta data.

Hadoop Archive Files In HDFS


Example:
•  Name Node holds meta data for the two files. Are Foo.txt and Bar .txt Name Node
   Foo.txt:blk-001, blk-002, blk-003Foo.txt:blk-004, blk-005
•  Data nodes hold the actual blacks
•  Each block will be 64MB OR 128MB in size
•  Each block is replicated three times on the cluster
 Different blocks
•  The Name Node daemon must be running at all times and if the Name Node stops, the cluster becomes in accessible and then the system administrator will take care to ensure that the Name Node hard ware is reliable
•  The Name Node holds all of its meta data in RAM for fast access and it keeps a record of changes on disk for crash recovery.
•  A separate daemon known as the secondary Name Node takes care of some housekeeping tasks for the Name Node and be careful that the secondary Name Node is not a  back up Name Node.

Note:
→ Although files are split into 64MB OR 12MB blocks, if a file is smaller than this the full 64MB / 128MB will not be used.
→ Blocks are stored as standard files on the data node, in a set of directories specified in hadoop configuration files and this will be set by the system administration
→ Without the meta data on the Name Node, there is no way to access the files in the HDFS Cluster

When a client application wants to read a file:
a) It communicates with the Name Node to determine which blocks make up the file and which data nodes those blocks reside on.
b) It then communicates directly with the data nodes to read the data.
c) The Name Node will not be a bottle neck.

MindMajix Youtube Channel

Frequently asked Hadoop Interview Questions

List of Big Data Courses:

 Hadoop Adminstartion MapReduce
 Big Data On AWS Informatica Big Data Integration
 Bigdata Greenplum DBA Informatica Big Data Edition
 Hadoop Hive Impala
 Hadoop Testing Apache Mahout
Course Schedule
NameDates
Hadoop TrainingNov 02 to Nov 17View Details
Hadoop TrainingNov 05 to Nov 20View Details
Hadoop TrainingNov 09 to Nov 24View Details
Hadoop TrainingNov 12 to Nov 27View Details
Last updated: 04 Apr 2023
About Author

Ravindra Savaram is a Technical Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.

read less