This tutorial gives you an overview and talks about the fundamentals of Big Data & Hadoop.
- Hadoop is a term, you will hear and over again when discussing the processing of big data information.
- Apache Hadoop is an open-source soft ware frame work that supports data intensive distributed applications.
- Hadoop supports the running of applications.
- Hadoop supports the running of applications on large clusters of commodity hard ware.
- The Hadoop frame work transparently provides both reliability and data motion to applications. Learn much more during our Hadoop online training.
- Hadoop enables applications to work with huge amounts of data on various servers.
- Hadoop functions allow the existing data to be pulled from various places and use the map Reduce technology to push the query code and run a proper analysis, there fore returning the desired results.
- For the more specific functions, Hadoop has a large scale file system called Hadoop distributed file system(HDFS)which can write programs and manager the distribution of programs then accepts the results and then generates a data result set.
- Here, map reduce application is divided into many small fragments of work and each of which may be executed or re executed on any node in the cluster.
- Both map Reduce and Hadoop Distributed file system are designed so that node failures are automatically handled by the frame work.
- It enables the applications to work with thousands of computation – independent computers and peta bytes of data
- Hadoop was derived from Goggles map reduce and Goggles file system papers.
- Hadoop is written in the java programming language and is a top-level Apache Project being built and used by a global community of contributors.
- Hadoop and its related projects have many contributors from across the eco system.
- To start Hadoop, we must have Hadoop common package which contains a necessary JAR files and scripts. A contribution section that includes projects from the Hadoop
Understanding distributed system and Hadoop
- To understand the popularity of distributed system and Hadoop in our Hadoop online training, consider the price perform once of current I/O technology.
- A high-end machine with four I/O channels each having a through put of 100MB/Sec will require 3 hours to read a 4 TB data set.
- With Hadoop training, you will realise, that the same data set will be divided in to smaller blocks that are spread among many machines in the cluster via Hadoop Distributed File system and block will be typically 64MB.
- With a modest degree of replication, the cluster machines can read the data set in parallel and provide a much higher through put and such a cluster of commodity machines turns out to be cheaper than one high–end server.
Comparing SQL data bases and Hadoop
- In RDBMS, The data will be stored in the form of tables and structured data.
- In Hadoop, we can store any type of data like unstructured data, images, Google maps etc.
- In RDBMS, we can store GB’s of data only In Hadoop, we can store any amount of data i.e. there is no limitation.
- In RDBMS, assigning the primary key value, it will not allow the duplicate values, at that time it will give primary key constant errors. In Hadoop, there is no such type of keys or constraints and instead of that we are having 3 replicas.
For more information about Hadoop, please refer http://hadoop.apache.org/
We have compiled few more articles to get you acquainted with our Hadoop training programme. We will cover these indepth in our Hadoop online training sessions.
DISTRIBUTED FILE SYSTEMS IN HADOOP
Installing Cloudera QuickStart VM in Oracle VirtualBox
This section describes how to get started with Hadoop by importing the CDH virtual machine in Oracle VirtualBox. This 64-bit VM consists of a single-node Apache Hadoop cluster with lots of examples that is a good starting point for experimenting with Hadoop.
Host: The machine where VirtualBox or other Virtualization software in installed.
Guest: The virtual machine supported by the virtualization software is called as guest.
CDH 5 guest works well with RAM as low as 4GB but for Cloudera Manager to work the memory requirements go up to 8GB.
The software versions used are:
1. Oracle VM VirtualBox Installation on Host
Download the platform dependent version of the VirtualBox from the Oracle VirtualBox website.
2. Cloudera CDH 5 VM Guest
Download the virtualization software dependent version of the QuickStart VM from Cloudera’s website.
3. Importing the CDH VM
Start the VirtualBox and Click on ‘Import an appliance’ in the menu:
Select the .ovf file you have downloaded in the previous step. The guest’s setting will be displayed here. Click on the ‘Reinitialize the MAC address of a network cards’ checkbox:
4. VM Startup
On startup, Hue will be started automatically in a browser.
Enter the following login details:
5. View Tutorials
Click on ‘Get Started’ for tutorials on various tutorials: