• HDFS Shares small files in efficiently, since each file is stored in a block and block meta data is held in memory by the Name Node.
• Thus, a large number of small files can take a lot of memory on the Name Node for example, 1 MB file is stored with a block size of 128 MB uses 1.MB of disk space not 128 MB.
• Hadoop Archives or HAR files, are a file archiving facility that packs files into HDFS blocks more efficiently, there by reducing Name Node memory usage while still allowing transparent access to files,
• Hadoop Archives can be used as input to map reduce.
Using Hadoop Archives:
→ A Hadoop Archives is created from a collection of files using the archivetool, which runs a map reduce job to process the input files in parallel and to run it, you need a map reduce cluster running to use it.
→ Here are some files in Hadoop Distributed File System (HDFS) that you would like to archieve:
→ Now we can run the archivecommand:
% hadoop archive–archivename files.har/my/files/my
HRA files always have a .har extension which is mandatory
→ Here we are achieving only one source here, the files in /my/files in HDFS, but the tool accepts multiple source trees and the final argument is the out put directory for the HAR file
→ The archive created for the above command is
Found 2 items
Drwr-x-x – tom super group 0 2009-04-09 19:13/my/files
Drwr-x-x – tom super group 0 2009-04-09 19:13/my/files hor
→ In HDFS, Blocks are replicated across multiple machines known as data nodes and default replication is three – fold i.e. each block exists on three different machines
→ A master node called the Name Node keeps tracker of which blocks make up a file and where those blocks are located known as the meta data.
• Name Node holds meta data for the two files. Are Foo.txt and Bar .txt Name Node
Foo.txt:blk-001, blk-002, blk-003Foo.txt:blk-004, blk-005
• Data nodes hold the actual blacks
• Each block will be 64MB OR 128MB in size
• Each block is replicated three times on the cluster
When a client application wants to read a file:
a) It communicates with the Name Node to determine which blocks make up the file and which data nodes those blocks reside on.
b) It then communicates directly with the data nodes to read the data.
c) The Name Node will not be a bottle neck.
|Big Data On AWS||Informatica Big Data Integration|
|Bigdata Greenplum DBA||Informatica Big Data Edition|
|Hadoop Testing||Apache Mahout|
Stay updated with our newsletter, packed with Tutorials, Interview Questions, How-to's, Tips & Tricks, Latest Trends & Updates, and more ➤ Straight to your inbox!
|Hadoop Training||Dec 09 to Dec 24||View Details|
|Hadoop Training||Dec 12 to Dec 27||View Details|
|Hadoop Training||Dec 16 to Dec 31||View Details|
|Hadoop Training||Dec 19 to Jan 03||View Details|
Ravindra Savaram is a Technical Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.
Copyright © 2013 - 2023 MindMajix Technologies