Hadoop Archive Files In HDFS

Hadoop Articles

Hadoop Quiz

Test and Explore your knowledge

Hadoop Archive

• HDFS Shares small files in efficiently, since each file is stored in a block and block meta data is held in memory by the Name Node.
• Thus, a large number of small files can take a lot of memory on the Name Node for example, 1 MB file is stored with a block size of 128 MB uses 1.MB of disk space not 128 MB.
• Hadoop Archives or HAR files, are a file archiving facility that packs files into HDFS blocks more efficiently, there by reducing Name Node memory usage while still allowing transparent access to files,
• Hadoop Archives can be used as input to map reduce.

Inclined to build a profession as Hadoop Developer? Then here is the blog post on HADOOP TRAINING ONLINE.

Using Hadoop Archives:
→ A Hadoop Archives is created from a collection of files using the archivetool, which runs a map reduce job to process the input files in parallel and to run it, you need a map reduce cluster running to use it.
→ Here are some files in Hadoop Distributed File System (HDFS) that you would like to archieve:
%hadoop fs-lsr/my/files.
→ Now we can run the archivecommand:
% hadoop archive–archivename files.har/my/files/my
HRA files always have a .har extension which is mandatory
→ Here we are achieving only one source here, the files in /my/files in HDFS, but the tool accepts multiple source trees and the final argument is the out put directory for the HAR file
→ The archive created for the above command is
%hadoop fs-ls/my
Found 2 items
Drwr-x-x – tom super group 0 2009-04-09 19:13/my/files
Drwr-x-x – tom super group 0 2009-04-09 19:13/my/files hor
→ In HDFS, Blocks are replicated across multiple machines known as data nodes and default replication is three – fold i.e. each block exists on three different machines
→ A master node called the Name Node keeps tracker of which blocks make up a file and where those blocks are located known as the meta data.

Example:
• Name Node holds meta data for the two files. Are Foo.txt and Bar .txt Name Node
Foo.txt:blk-001, blk-002, blk-003Foo.txt:blk-004, blk-005
• Data nodes hold the actual blacks
• Each block will be 64MB OR 128MB in size
• Each block is replicated three times on the cluster

• The Name Node daemon must be running at all times and if the Name Node stops, the cluster becomes in accessible and then the system administrator will take care to ensure that the Name Node hard ware is reliable
• The Name Node holds all of its meta data in RAM for fast access and it keeps a record of changes on disk for crash recovery.
• A separate daemon known as the secondary Name Node takes care of some housekeeping tasks for the Name Node and be careful that the secondary Name Node is not a back up Name Node.

Note:
→ Although files are split into 64MB OR 12MB blocks, if a file is smaller than this the full 64MB / 128MB will not be used.
→ Blocks are stored as standard files on the data node, in a set of directories specified in hadoop configuration files and this will be set by the system administration
→ Without the meta data on the Name Node, there is no way to access the files in the HDFS Cluster

When a client application wants to read a file:
a) It communicates with the Name Node to determine which blocks make up the file and which data nodes those blocks reside on.
b) It then communicates directly with the data nodes to read the data.
c) The Name Node will not be a bottle neck.

Frequently asked Hadoop Interview Questions

List of Big Data Courses:

Hadoop Adminstartion	MapReduce
Big Data On AWS	Informatica Big Data Integration
Bigdata Greenplum DBA	Informatica Big Data Edition
Hadoop Hive	Impala
Hadoop Testing	Apache Mahout

On-Job Support Service

Online Work Support for your on-job roles.

@Learner@SME

Our work-support plans provide precise options as per your project tasks. Whether you are a newbie or an experienced professional seeking assistance in completing project tasks, we are here with the following plans to meet your custom needs:

Pay Per Hour
Pay Per Week
Monthly

Learn MoreContact us

Course Schedule

Name	Dates
Hadoop Training	Jul 26 to Aug 10	View Details
Hadoop Training	Jul 29 to Aug 13	View Details
Hadoop Training	Aug 02 to Aug 17	View Details
Hadoop Training	Aug 05 to Aug 20	View Details

Last updated: 04 Apr 2023

About Author

Ravindra Savaram

Ravindra Savaram is a Technical Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.

read less

Recommended Courses

Denodo Training

4.6

532

Elasticsearch Training

4.6

824

1 / 15