Home  >  Blog  >   Hadoop  > 

Hadoop Archive Files In HDFS

Rating: 5

Hadoop Archive

•  HDFS Shares small files in efficiently, since each file is stored in a block and block meta data is held in memory by the Name Node.
•  Thus, a large number of small files can take a lot of memory on the Name Node for example, 1 MB file is stored with a block size of 128 MB uses 1.MB of disk space not 128 MB.
•  Hadoop Archives or HAR files, are a file archiving facility that packs files into HDFS blocks more efficiently, there by reducing Name Node memory usage while still allowing transparent access to files,
•  Hadoop Archives can be used as input to map reduce.

Inclined to build a profession as Hadoop Developer? Then here is the blog post on HADOOP TRAINING ONLINE.

Using Hadoop Archives:
→ A Hadoop Archives is created from a collection of files using the archivetool, which runs a map reduce job to process the input files in parallel and to run it, you need a map reduce cluster running to use it.
→ Here are some files in Hadoop Distributed File System (HDFS) that you would like to archieve:
%hadoop fs-lsr/my/files.
→ Now we can run the archivecommand:
% hadoop archive–archivename files.har/my/files/my
HRA files always have a .har extension which is mandatory
→ Here we are achieving only one source here, the files in /my/files in HDFS, but  the tool accepts multiple source trees and the final argument is the out put directory for the HAR file
→ The archive created for the above command is
%hadoop fs-ls/my
Found 2 items
Drwr-x-x – tom super group 0 2009-04-09 19:13/my/files
Drwr-x-x – tom super group 0 2009-04-09 19:13/my/files hor
→ In HDFS, Blocks are replicated across multiple machines known as data nodes and default replication is three – fold i.e. each block exists on three different machines
→ A master node called the Name Node keeps tracker of which blocks make up a file and where those blocks are located known as the meta data.

Hadoop Archive Files In HDFS

•  Name Node holds meta data for the two files. Are Foo.txt and Bar .txt Name Node
   Foo.txt:blk-001, blk-002, blk-003Foo.txt:blk-004, blk-005
•  Data nodes hold the actual blacks
•  Each block will be 64MB OR 128MB in size
•  Each block is replicated three times on the cluster
 Different blocks
•  The Name Node daemon must be running at all times and if the Name Node stops, the cluster becomes in accessible and then the system administrator will take care to ensure that the Name Node hard ware is reliable
•  The Name Node holds all of its meta data in RAM for fast access and it keeps a record of changes on disk for crash recovery.
•  A separate daemon known as the secondary Name Node takes care of some housekeeping tasks for the Name Node and be careful that the secondary Name Node is not a  back up Name Node.

→ Although files are split into 64MB OR 12MB blocks, if a file is smaller than this the full 64MB / 128MB will not be used.
→ Blocks are stored as standard files on the data node, in a set of directories specified in hadoop configuration files and this will be set by the system administration
→ Without the meta data on the Name Node, there is no way to access the files in the HDFS Cluster

When a client application wants to read a file:
a) It communicates with the Name Node to determine which blocks make up the file and which data nodes those blocks reside on.
b) It then communicates directly with the data nodes to read the data.
c) The Name Node will not be a bottle neck.

MindMajix Youtube Channel

Frequently asked Hadoop Interview Questions

List of Big Data Courses:

 Hadoop Adminstartion  MapReduce
 Big Data On AWS  Informatica Big Data Integration
 Bigdata Greenplum DBA  Informatica Big Data Edition
 Hadoop Hive  Impala
 Hadoop Testing  Apache Mahout
Join our newsletter

Stay updated with our newsletter, packed with Tutorials, Interview Questions, How-to's, Tips & Tricks, Latest Trends & Updates, and more ➤ Straight to your inbox!

Course Schedule
Hadoop TrainingJun 06 to Jun 21
Hadoop TrainingJun 10 to Jun 25
Hadoop TrainingJun 13 to Jun 28
Hadoop TrainingJun 17 to Jul 02
Last updated: 05 June 2023
About Author
Remy Sharp
Ravindra Savaram

Ravindra Savaram is a Content Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.

Recommended Courses

1 /15