- Hadoop is a term you will hear and over again when discussing the processing of big data information.
- Apache Hadoop is an open-source software framework that supports data-intensive distributed applications.
- Hadoop supports the running of applications.
- Hadoop supports the running of applications on large clusters of commodity hardware.
- The Hadoop framework transparently provides both reliability and data motion to applications.
- Hadoop enables applications to work with huge amounts of data on various servers.
- Hadoop functions allow the existing data to be pulled from various places and use the map Reduce technology to push the query code and run a proper analysis, therefore returning the desired results.
- For the more specific functions, Hadoop has a large scale file system called Hadoop distributed file system(HDFS)which can write programs and manage the distribution of programs then accepts the results and then generates a data result set.
- Here, a map-reduce application is divided into many small fragments of work and each of which may be executed or re-executed on any node in the cluster.
- Both Map Reduce and Hadoop Distributed file system are designed so that node failures are automatically handled by the framework.
- It enables the applications to work with thousands of computation – independent computers and petabytes of data
- Hadoop was derived from Goggles map reduce and Goggles file system papers.
- Hadoop is written in the Java programming language and is a top-level Apache Project being built and used by a global community of contributors.
- Hadoop and its related projects have many contributors from across the ecosystem.
- To start Hadoop, we must have Hadoop common package which contains necessary JAR files and scripts.
A contribution section that includes projects from the Hadoop
Understanding distributed system and Hadoop:
- To understand the popularity of distributed system and Hadoop, consider the price perform once of current I/O technology.
- A high-end machine with four I/O channels each having a throughput of 100MB/Sec will require 3 hours to read 4 TB data set.
- With Hadoop, the same data set will be divided into smaller blocks that are spread among many machines in the cluster via Hadoop Distributed Filesystem and block will be typically 64MB.
- With a modest degree of replication, the cluster machines can read the data set in parallel and provide a much higher throughput and such a cluster of commodity machines turns out to be cheaper than one high–end server.
Comparing SQL databases and Hadoop:
- In RDBMS, The data will be stored in the form of tables and structured data.
- In Hadoop, we can store any type of data like unstructured data, images, Google maps etc.
- In RDBMS, we can store GB’s of data only In Hadoop, we can store any amount of data i.e. there is no limitation.
- In RDBMS, assigning the primary key value, it will not allow the duplicate values, at that time it will give primary key constant errors. In Hadoop, there is no such type of keys or constraints and instead of that, we are having 3 replicas.
Hadoop Overview and its Eco-System
- Hadoop is an open source implementation of the map-reduce platform and distributed file system, written in java
- Hadoop is actually a collection of tools, and an ecosystem built on top of the tools.
- The problem Hadoop solves is how to store and process big data. And when we need to store and process petabytes of information, the monolithic approach to computing no longer makes sense
- When data is loaded into the system, it is split into blocks i.e typically 64MB or 128 MB.
- The first part of the Map-Reduce system is to work on relatively some portions of data in a single block.
- A master program allocates work to nodes such that a map task will work on a block of data stored locally on that node whenever possible and many nodes work in parallel, each on their own part of the overall data set.
Subscribe to our youtube channel to get new updates..!
Hadoop consists of two core components:
- The Hadoop Distributed file system(HDFS)
- Map Reduce.
- There are many other projects based around core Hadoop often referred to as the Hadoop Ecosystem.
- The Hadoop Ecosystem act Pig, Hive, H Base, Flume, Oozie, sqoop, Zookeeper.
- A set of machines running HDFS and Map Reduce is known as a Hadoop Cluster.
- In Hadoop Cluster, Individual machines are known as nodes and a cluster can here as few as one node, as many as several thousands.
- If there are more nodes in a Hadoop cluster, performance is better.
Hadoop is comprised of fire separate daemons they are
- Name Node: Holds the metadata for HDFS
- Secondary Name Node: Performs housekeeping functions for the Name Node and is not a backup or hot standby for the Name Node.
- Data Node: Stores actual HDFS data blocks.
- Job Tracker: Manages Map Reduce jobs, distributes individual tasks to machines running.
- Task Tracker: Instantiates and monitors individual
- Each Daemon runs in its own Java virtual machine.
- No node on a real cluster will run all fire Daemons although this is technically possible.
We can consider nodes to be in two different categories:
Master Nodes: Run the Name Node, Secondary Name Node, Job Tracker daemons.
Slave Nodes: Run the Data node and task tracker daemons and a slave node will run both of these daemons
Basic Cluster Configuration:
- On very small clusters, the Name Node, Job Tracker, and secondary Name Node can all reside on a single machine and it is typical to put them on separator machines as the cluster grows beyond 20-30 Nodes
- Each dotted box on the previous diagram represents separate Java virtual machine(JVM)
List of Other Big Data Courses:
|Big Data On AWS||Informatica Big Data Integration|
|Bigdata Greenplum DBA||Informatica Big Data Edition|
|Hadoop Testing||Apache Mahout|