HDFS Architecture, Features & How to Access HDFS
- Applications can read and write HDFS files directly via the JAVA API.
- Typically ,files are created on a local file system and must be moved into HDFS
- Like wise, files stored in HDFS may need to be moved to machines local file system.
- Access to HDFS from the command line is archived with the hadoop fs command.
- HDFS has a master or slave architecture.
- HDFS cluster consists of single Name Node, a master server that manages the file system name space and regulates access to files by clients.
- In addition, there are number of data nodes and one per node in the cluster, which manage storage attached to the nodes that they run on
- In HDFS, a file which is stored is split in to one or more blocks and these blocks are stored in a set of data nodes
- The Name Node executes file system name space operations like opening, closing and renaming files and directories.
- It also determines the mapping of blocks to data nodes the data nodes also perform block creation, deletion and replication upon instruction from the Name Node
1. Hardware Failure
- Hard ware failure is norm rather than the exception
- HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file systems data.
- Files are replicated to handle hard ware failure and also defect failures and recovers from them.
2. Streaming data Access
- Applications that run on HDFS need streaming access to their data sets
- HDFS is designed more for batch processing rather than interactive use by users.
- The emphasis is on high through put of data access rather than low latency of data access.
3. Large Data sets
- Applications that run on HDFS hare large data sets.
- A typical file in HDFS is gigabytes to tera bytes in size.
- Thus, HDFS is trend to support large files and it should provide high aggregate data band width and scale to hundreds of nodes in a single cluster and also it should support tens of millions of files in a single instance.
4. Simple coherency model
- HDFS applications need a write-once-read many access model for files.
- A file once created, written and closed need to not be changed and this assumption simplifies data coherency issues and enables high through put data access.
- A map reduce application or a web crawler application fits perfectly with this model and support writes to files in
5. Moving computation is cheaper than moving data
- A computation requested by an application is much more efficient if it is executed near the data it operates on and it is true when the size of the data set is huge
- This minimizes net work congestion and increases the over all through put of the system and assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where there application is running.
- HDFS Provides interfaces for applications to more them selves closer to where the data is located.
6. Portability Across Heterogeneous Hard ware and soft ware plat forms
- HDFS has been designed to be easily portable from one plat form to another.
- This facilitates wide spread adoption of HDFS as a platform of choice for a large set of applications.
- HDFS can be accessed from applications in many different ways.
- Natively, HDFS Provides a JAVA API for applications to use and the other is FS Shell be command interface and browser interface.
Command line Interface to HDFS
- To get a directory listing the users home directory in HDFS is hadoop fs – ls.
- To get a directory listing of the HDFS root directory is hadoop fs – ls./
- To copy file foo. Txt from local disk to the users directory in HDFS is
hadoop fs – copy from local foo.txt foo.txt
This will copy the file ro /user/user name/foo.txt
- To display the contents of the HDFS file
Ex:- File path is /user/fred/bor.txt
Command is hadoop fs-cat/usr/fred/bar.txt
- Move the file to the local disk named as baz.txt is hadoop fs – copy to local /user/fred/bor.txt baz.txt
- To create a directory called input under the users home directory is hadoop fs-mk dir input
- To delete the directory and all its contents hadoop fs –rms input
% hadoop fs – ls /my/files.har
Found 3 items
-rw-r—r—10 tom super group 165 2009-04-09
-rw-r—r—10 tom super group 23 2009-04-09
19:13/ my/files.har/master index.
-rw-r—r—1 tom super group 2 2009-04-09
- This directory listing shows that a HAR file is made of 2 index files and a collection of part files and the part files conspires of contents of a number of original files concatenated together
- The following command recursively lists the files in the arcHive: %hadoop fs-lsr har:///my/files-har
Drw-r—r—tom super group 0 2009-04-09 19:13/my/files.har/my
Drw-r—r—tom super group 0 2009-04-09 19:13/my/files.har/my/ files
rw-r—r—10 tom super group 0 2009-04-09 19:13/my/files.har/my/ files/a
Drw-r—r—tom super group 0 2009-04-09 19:13/my/files.har/my/ files/dir
Drw-r—r—10 tom super group 1 2009-04-09 19:13/my/files.har/my/ files
- This is quite straight forward if the file system that the HAR file is on is the default file system.
- If you want to refer to a HAR file on a different file system then you need to use a different form of the path URI to normal.
- To delete a HAR file, you need to use the recursive form of delte since from the underlying file systems point of view the HAR file is a directory: %hadoop fs- rmr/my/files.har
- There are few limitations to be aware of with HAR files.
- Creating an archive creates a copy of the original files, so, you need as much disk space as the files you are archiving to create the archive
- There is currently no support for archive compression, the files that go into the archive can be compressed.
All the above topics will be covered under MindMajix Hadoop Training