Introduction to HDFS (Distributed File System)
[HDFS] Hadoop Distributed File System
- When a data set out grows the storage capacity of a single physical machine, it becomes necessary to partition it across a number of separate machines and the file systems that manage the storage across a network of machines called Distributed file systems.
- Since they are net work based, all the complications of net work programming kick in, thus making distributed file systems more complex than regular disk file systems.
- For example, one of the biggest challenge is making the file system tolerate node failure without suffering data loss.
- Hadoop comes with a Distributed file system called HDFS, which is designed for storing very large files with streaming data access patterns, running on clusters of commodity hard ware
HDFS Basic concepts:
- HDFS is a file system written in java which is based on Goggles GFS, and sits on top of a native file system such as ext3, ext4 or xfs.
- HDFS provides redundant storage for massive amounts of data using cheap unreliable computers.
- HDFS performs best with a ‘modest’ number of large files i.e millions of files rather than billions of files and each file typically 100MB or more
- Files in HDFS are write once and read many times.
- In HDFS, no random writes to files are allow and append support is included in cloud era’s Distribution including Apache Hadoop(CDH)for H Base reliability and not recommended for general use
- HDFS is optimized for large, streaming reads of files rather than random reads.
HDFS Files Storage:
- In HDFS, Files are split in to blocks and each block is usually 64MB or 128 MB
- Data is distributed across many machines at load time.
- Different blocks from the same file will be stored on different machines and also provides for efficient