Introduction to Hadoop – Big Data Overview
- Hadoop is a term you will hear and over again when discussing the processing of big data information.
- Apache Hadoop is an open-source soft ware frame work that supports data intensive distributed applications.
- Hadoop supports the running of applications.
- Hadoop supports the running of applications on large clusters of commodity hard ware.
- The Hadoop frame work transparently provides both reliability and data motion to applications.
- Hadoop enables applications to work with huge amounts of data on various servers.
- Hadoop functions allow the existing data to be pulled from various places and use the map Reduce technology to push the query code and run a proper analysis, there fore returning the desired results.
- For the more specific functions, Hadoop has a large scale file system called Hadoop distributed file system(HDFS)which can write programs and manager the distribution of programs then accepts the results and then generates a data result set.
- Here, map reduce application is divided into many small fragments of work and each of which may be executed or re executed on any node in the cluster.
- Both map Reduce and Hadoop Distributed file system are designed so that node failures are automatically handled by the frame work.
- It enables the applications to work with thousands of computation – independent computers and peta bytes of data
- Hadoop was derived from Goggles map reduce and Goggles file system papers.
- Hadoop is written in the java programming language and is a top-level Apache Project being built and used by a global community of contributors.
- Hadoop and its related projects have many contributors from across the eco system.
- To start Hadoop, we must have Hadoop common package which contains a necessary JAR files and scripts.
A contribution section that includes projects from the Hadoop
Understanding distributed system and Hadoop:
- To understand the popularity of distributed system and Hadoop, consider the price perform once of current I/O technology.
- A high-end machine with four I/O channels each having a through put of 100MB/Sec will require 3 hours to read a 4 TB data set.
- With Hadoop, the same data set will be divided in to smaller blocks that are spread among many machines in the cluster via Hadoop Distributed File system and block will be typically 64MB.
- With a modest degree of replication, the cluster machines can read the data set in parallel and provide a much higher through put and such a cluster of commodity machines turns out to be cheaper than one high–end server.
Comparing SQL data bases and Hadoop:
- In RDBMS, The data will be store in the form of tables and structured data.
- In Hadoop, we can store any type of data like unstructured data, images, Google maps etc.
- In RDBMS, we can store GB’s of data only In Hadoop, we can store any amount of data i.e. there is no limitation.
- In RDBMS, assigning the primary key value, it will not allow the duplicate values, at that time it will give primary key constant errors. In Hadoop, there is no such type of keys or constraints and instead of that we are having 3 replicas.
Hadoop Overview and its Eco- System
- Hadoop is a open source implementation of the map reduce plat form and distributed file system, written in java
- Hadoop is actually a collection of tools, and an eco system built on top of the tools.
- The problem Hadoop solves is how to store and process big data. And when we need to store and process peta bytes of information, the monolithic approach to computing no longer makes sense
- When data is loaded into the system, it is split into blocks i.e typically 64MB or 128 MB.
- The first part of the Map Reduce system is to work on relatively some portions of data in a single block.
- A master program allocates work to nodes such that a map task will work on a block of data stored locally on that node when ever possible and many nodes work in parallel, each on their own part of the overall data set.
Hadoop consists of two core components:
1. The Hadoop Distributed file system(HDFS)
2. Map Reduce.
- There are many other projects based around core Hadoop often referred to as the Hadoop Eco system.
- The Hadoop Eco system act Pig, Hive, H Base, Flume, Oozie, sqoop, Zookeeper.
- A set of machines running HDFS and Map Reduce is known as a Hadoop Cluster.
- In Hadoop Cluster, Individual machines are known as nodes and a cluster can here as few as one node, as many as several thousands.
- If there are more nodes in a Hadoop cluster, performance is better.
Hadoop is comprised of fire separate daemons they are
- Name Node: Holds the meta data for HDFS
- Secondary Name Node: Performs house keeping functions for the Name Node and is not a back up or hot stand by for the Name Node.
- Data Node: Stores actual HDFS data blocks.
- Job Tracker: Manages Map Reduce jobs, distributes individual tasks to machines running.
- Task Tracker: Instantiates and monitors individual
→ Each Daemon runs in its own Java virtual machine.
→ No node on a real cluster will run all fire Daemons although this is technically possible.
→ We can consider nodes to be in two different categories:
Master Nodes: Run the Name Node, Secondary Name Node, Job Tracker daemons.
Slave Nodes: Run the Data node and task tracker daemons and a slave node will run both of these daemons
Basic Cluster Configuration:
- On very small clusters, the Name Node, Job Tracker and secondary Name Node can all reside on a single machine and it is typical to put them on separator machines as the cluster grows beyond 20-30 Nodes
- Each dotted box on the previous diagram represents separate java virtual machine(JVM)
Enroll for Live Instructor Led Online Hadoop Training