Before we get into the context where Hadoop comes and fits in, firstly we need to understand what is ‘Big Data’. Big Data is no different from the normal data that we handle in our daily lives, like data we have at our Work, data on Gmail or on Yahoo mail in our inboxes (but with a small twist, it is the whole data that the Organization holds or whole of the data that Google hosts on their Gmail servers or Yahoo hosts on their Yahoo Servers and the like).
When we have such data at hand and we need to deduce the needed details out of it, is it not going to be like searching for a needle in the haystack? Definitely YES, so the answer to this question is the frameworks like Hadoop. The next few sections of this article is completely dedicated to understanding what this framework can do as a whole. Since we have discussed and introduced what data constitutes the ‘Big Data’, let us understand the characteristics of the same.
On a broader way, Big Data can be classified into 3 categories as described below:
Structured data can be classified as that category of data which can be stored, accessed and processed in a fixed format. The most convenient data that we can classify as Structured Data is the data available in our RDBMS systems (databases and tables).
Unstructured data can be classified as that category of data which has no known form or structure, being heavy in size is also another trait to this classification. Since the data that is available in this format is not structured at all, storage, access and processing of such data is also a tough process. The most convenient data that we can classify as unstructured data is the data available as the Search results from Google.
Semi-structured data is a combination of the data in the forms described above and the most convenient data that we can classify as semi-structured data is the data available in XML files.
Hadoop is a Java-based framework for programming which is basically for storage and for processing of extremely large datasets in distributed computing environments. In addition to this, it is also an open source framework and is part of the Apache project (internally sponsored by the Apache Software Foundation).
That being said, what were the traditional methodologies used earlier to Hadoop and what gave Hadoop an instantaneous success in the realm of Big Data and Analytics? The way Hadoop handles data is almost the same as any other file system and is named HDFS (Hadoop Distributed File System). Traditionally if you were to query your RDMBS servers to gain information to deduce the needed details, you would have to have a dedicated server which let's go all the Normalization details to satisfy the reporting requirements.
HDFS is more like a container for Hadoop where it stores all the data that you want to analyze further to deduce valuable information out of it. The process of processing data is done by a Java-based system known as the MapReduce. SQL or NoSQL, Hadoop is not exactly a database. Hadoop is more like a data warehouse system which needs a technique as like MapReduce to actually process the data that is held on the HDFS system.
Now with the necessary background been understood, now let us focus on the most important note on the topic – What is required of an individual to be successful with Hadoop? Let us break it down into multiple jotted points to understand the importance of each.
1. Java: Since Hadoop is basically written in Java, you need to at least have the basics of this programming language to get your hands dirty with. There is no need to be disappointed if you are from a totally different background, there are still opportunities to work on Hadoop as there are growing opportunities in the field of Big Data
2.Linux: Hadoop most basically is run on Linux for yielding better performances over Windows, so considering that basic knowledge of Linux will suffice and more the merrier.
3.Understanding on Big Data: Though this is not a definite requirement to learn Hadoop framework, an individual has to definitely understand where he/she is stepping into.
Once you have a good understanding of the above 3 points, then you can proceed with your quest to excel in Hadoop. Though Java as a technology won’t stop you from making your mark in the Hadoop framework (if you hail from a different technical background altogether, but makes the path a bit difficult as there would be a bigger learning curve).
In this article, we have seen what actually is Big Data and understanding in what context does Hadoop exactly fit in. With that understanding, we have also introduced the basics of Hadoop and how it works.
Coming on to the most important point of the article, if you are a Java developer then your learning curve to excel in Hadoop is small and if you hail from a different technical background then, your learning curve is a little bigger (as you need to understand and appreciate the inner workings of Hadoop).
|Big Data On AWS||Informatica Big Data Integration|
|Bigdata Greenplum DBA||Informatica Big Data Edition|
|Hadoop Testing||Apache Mahout|
Free Demo for Corporate & Online Trainings.