Blog

Introduction to HDFS (Hadoop Distributed File System)

  • (4.0)
  •   |   567 Ratings

[HDFS] Hadoop Distributed File System

  • When a data set out grows the storage capacity of a single physical machine, it becomes necessary to partition it across a number of separate machines and the file systems that manage the storage across a network of machines called Distributed file systems.
  • Since they are net work based, all the complications of net work programming kick in, thus making distributed file systems more complex than regular disk file systems.
  • For example, one of the biggest challenge is making the file system tolerate node failure without suffering data loss.
  • Hadoop comes with a Distributed file system called HDFS, which is designed for storing very large files with streaming data access patterns, running on clusters of commodity hard ware
Interested in mastering MapReduce? Enroll now for FREE demo on MapReduce Training.

HDFS Basic concepts:

  • HDFS is a file system written in java which is based on Goggles GFS, and sits on top of a native file system such as ext3, ext4 or xfs.
  • HDFS provides redundant storage for massive amounts of data using cheap unreliable computers.
  • HDFS performs best with a ‘modest’ number of large files i.e millions of files rather than billions of files and each file typically 100MB or more
  • Files in HDFS are write once and read many times.
  • In HDFS, no random writes to files are allow and append support is included in cloud era’s Distribution including Apache HADOOP(CDH)for H Base reliability and  not recommended for general use
  • HDFS is optimized for large, streaming reads of files rather than random reads.

HDFS Files Storage:

  • In HDFS, Files are split in to blocks and each block is usually 64MB or 128 MB
  • Data is distributed across many machines at load time.
  • Different blocks from the same file will be stored on different machines and also provides for efficient

Frequently asked Mapreduce Interview Questions

List of Other Big Data Courses:

 Hadoop Adminstartion  MapReduce
 Big Data On AWS  Informatica Big Data Integration
 Bigdata Greenplum DBA  Informatica Big Data Edition
 Hadoop Hive  Impala
 Hadoop Testing  Apache Mahout

 


Popular Courses in 2018

Get Updates on Tech posts, Interview & Certification questions and training schedules