Day in and day out, without even realizing we are creating loads and loads of data via the Social Media platforms like Facebook, LinkedIn and etc. With the advent of such platforms, data is growing exponentially with such huge volumes. To understand the data that gets created on a daily basis, the categorization of this data is very important –
Structured Data: The data that can be stored in a database, with a specific set of columns in an organized manner.
Semi-structured Data: The data that can be partially saved on to a database and the rest of the data has to reside on a different means of storage. Example, XML records that can be partially saved on to a database.
Unstructured Data: Any data that doesn’t fit in the first two categories of data will be qualified to be coined as unstructured data (which is also commonly referred to as “Big Data”)
With the necessary details and the introduction provided, we can now safely introduce the framework named Apache Hadoop as the framework that is used for processing the Big Data. It is also a very famous framework that serves the need of storing, analyzing bigger volumes of Big Data. This is a latest buzz word that you would hear that deals with analyzing, processing TB’s (Terabytes), PB’s (Petabytes) or even ZB’s (Zettabytes) of data. The Hadoop ecosystem typically serves the following two purposes:
Storing enormous amounts of data:
This is attained by adding more and more nodes to the Hadoop cluster. Generally the block size within the Hadoop Distributed File System (HDFS) is either 64MB or 128MB as against the usual of 128KB.
Bringing computation to data:
In a very traditional way, the data is extracted to the clients for the actual computational processes. But the opposite works fine with Apache Hadoop as the data stored is such huge that it is more efficient to the opposite of the traditional way – mainly happens by the MapReduce jobs.
The Hadoop ecosystem contains more than 20 components to its complete fully functional working and we will look at the only two components that matter for this article – Apache Pig & Apache Hive.
Hive is very much similar to a SQL Interface in Hadoop and all the DML operations of a traditional RDBMS be compared with the Hive specific functions like - Hive select, where, group by, and order by clauses. Data is stored in the HBase component within the Hadoop ecosystem and is accessible via HiveQL. Hive is a simple solution for those who do not have much introduction into using MapReduce jobs on Hadoop, internally the queries are transformed into MapReduce jobs.
Hive can thus be described as the following:
Data warehouse infrastructure
Provider of HiveQL similar to SQL
Provider of utility tools for data extraction, transformation and loading of data
Allows embedding customized mappers, reducers
Provider of powerful statistics functions
Hive is gaining its popularity as it is supported by Hue
Pig is an ad-hoc method to create or execute MapReduce jobs on big datasets. Prime motto behind developing Pig was to curtail the time that is required for development via queries. It is a high level data flow system which helps rendering simple language platform that is termed as Pig Latin – helps in manipulating data and queries. The following are the reasons that make Pig much popular amongst the components available in Hadoop ecosystem:
Follows multi-query approach to avoid multiple scans of the datasets.
Pig is easy if you are well aware with SQL
Pig provides nested data types like Maps, Tuples and Bags which are not available for usage in MapReduce
Pig also provides support to major data operations like Ordering, Filters and Joins
Performance wise Pig surpasses that of raw MapReduce
There are lots of factors that define these components altogether and hence by its usage, and also by its purpose there are differences between these two components from the Hadoop ecosystem. Hence let us try to understand the purposes for which these are used and worked upon.
|Hadoop Hive||Hadoop Pig|
|Hadoop Hive component is mainly used by the Data Analysts||Pig Hadoop component is generally used by the Researchers and Programmers|
|Hive is used against completely structured data||Pig is used against semi structured data|
|Hive makes use of SQL expertise and hence a very small learning curve for the SQL developers||Though Pig is as well an SQL like variant, there is a great variation associated with it and hence the learning curve to learn Pig is considerably big|
|Hive has a declarative SQL like language termed as HiveQL||Pig has a procedural data flow like language termed as Pig Latin|
|Hive is basically used for generation / creation of reports||Pig is basically used for programming|
|Hive operates on the server side of a HDFS cluster||Pig operates on the client side of a HDFS cluster|
|Hive is very helpful in the areas of ETL||Pig is a wonderful ETL tool for Big Data (for its powerful transformation and processing capabilities)|
|Hive has an ability to start an optional thrift based server which is used to send queries from any part to the Hive Server directly to execute||Pig does not provide any such provision for this feature|
|Hive leverages upon the SQL DLL language with definitions to tables upfront and storing the schema details on a local database||There is no provision of maintaining a dedicated metadata database and hence the schemas / data types are defined in the actual scripts itself in Pig|
|There is no provision from Hive to support Avro||Pig provides support to Avro|
|There is no provision of installation for Hive as it is completely shell based for interaction||Pig on the other had can be installed very easily|
|Hive provisions partitions on the data to process subsets based on dates or in chronological orders||Pig do not provision anything like partitions directly but the feature can be achieved using Filters|
|There is no provision in Hive for illustrations||Pig renders sample data for each of its scenarios through Illustrate function|
|There is a provision in Hive to access raw data||Raw data access is not possible with Pig Latin scripts as fast as HiveQL|
|In Hive, a user can join data, order data and even can sort data dynamically (in an aggregated manner though)||There is a provision from Pig to perform OUTER JOINS using the COGROUP feature.|
To come to a conclusion after having gone through all the necessary details to understand Big Data, Hadoop ecosystem and the components of choice of this article – Hive and Pig, it is clearly understood that there is no battle between Hive and Pig as such. Each of this provide a unique way of analyzing Big Data with reasonable affection towards Pig from the programmer community and Hive being admired by database developers.
|Big Data On AWS||Informatica Big Data Integration|
|Bigdata Greenplum DBA||Informatica Big Data Edition|
|Hadoop Testing||Apache Mahout|
Get Updates on Tech posts, Interview & Certification questions and training schedules