Blog

Difference between Pig and Hive

  • (4.0)

Introduction:

Day in and day out, without even realizing we are creating loads and loads of data via the Social Media platforms like Facebook, LinkedIn and etc. With the advent of such platforms, data is growing exponentially with such huge volumes. To understand the data that gets created on a daily basis, the categorization of this data is very important –

Structured Data: The data that can be stored in a database, with a specific set of columns in an organized manner.

Semi-structured Data: The data that can be partially saved on to a database and the rest of the data has to reside on a different means of storage. Example, XML records that can be partially saved on to a database.

Unstructured Data: Any data that doesn’t fit in the first two categories of data will be qualified to be coined as unstructured data (which is also commonly referred to as “Big Data”)

Accelerate your career with Hadoop Training and become expertise in Apache Hadoop.

What is Big Data?

With the necessary details and the introduction provided, we can now safely introduce the framework named Apache Hadoop as the framework that is used for processing the Big Data. It is also a very famous framework that serves the need of storing, analyzing bigger volumes of Big Data. This is a latest buzz word that you would hear that deals with analyzing, processing TB’s (Terabytes), PB’s (Petabytes) or even ZB’s (Zettabytes) of data. The Hadoop ecosystem typically serves the following two purposes:

  1. Storing enormous amounts of data:

This is attained by adding more and more nodes to the Hadoop cluster. Generally the block size within the Hadoop Distributed File System (HDFS) is either 64MB or 128MB as against the usual of 128KB.

  1. Bringing computation to data:

In a very traditional way, the data is extracted to the clients for the actual computational processes. But the opposite works fine with Apache Hadoop as the data stored is such huge that it is more efficient to the opposite of the traditional way – mainly happens by the MapReduce jobs.

The Hadoop ecosystem contains more than 20 components to its complete fully functional working and we will look at the only two components that matter for this article – Apache Pig & Apache Hive.

Checkout Hadoop Interview Questions

What is Apache Hadoop Hive?

Hive is very much similar to a SQL Interface in Hadoop and all the DML operations of a traditional RDBMS be compared with the Hive specific functions like - Hive select, where, group by, and order by clauses. Data is stored in the HBase component within the Hadoop ecosystem and is accessible via HiveQL. Hive is a simple solution for those who do not have much introduction into using MapReduce jobs on Hadoop, internally the queries are transformed into MapReduce jobs.

Hive can thus be described as the following:

Data warehouse infrastructure

Provider of HiveQL similar to SQL

Provider of utility tools for data extraction, transformation and loading of data

Allows embedding customized mappers, reducers

Provider of powerful statistics functions

Hive is gaining its popularity as it is supported by Hue

Checkout Hadoop Tutorial

What is Apache Hadoop Pig?

Pig is an ad-hoc method to create or execute MapReduce jobs on big datasets. Prime motto behind developing Pig was to curtail the time that is required for development via queries. It is a high level data flow system which helps rendering simple language platform that is termed as Pig Latin – helps in manipulating data and queries. The following are the reasons that make Pig much popular amongst the components available in Hadoop ecosystem:

Follows multi-query approach to avoid multiple scans of the datasets.

Pig is easy if you are well aware with SQL

Pig provides nested data types like Maps, Tuples and Bags which are not available for usage in MapReduce

Pig also provides support to major data operations like Ordering, Filters and Joins

Performance wise Pig surpasses that of raw MapReduce

Explore Hadoop Sample Resumes! Download & Edit, Get Noticed by Top Employers!Download Now!

Difference between Apache Pig and Apache Hive:

There are lots of factors that define these components altogether and hence by its usage, and also by its purpose there are differences between these two components from the Hadoop ecosystem. Hence let us try to understand the purposes for which these are used and worked upon.

Hadoop Hive Hadoop Pig
Hadoop Hive component is mainly used by the Data Analysts Pig Hadoop component is generally used by the Researchers and Programmers
Hive is used against completely structured data Pig is used against semi structured data
Hive makes use of SQL expertise and hence a very small learning curve for the SQL developers Though Pig is as well an SQL like variant, there is a great variation associated with it and hence the learning curve to learn Pig is considerably big
Hive has a declarative SQL like language termed as HiveQL Pig has a procedural data flow like language termed as Pig Latin
Hive is basically used for generation / creation of reports Pig is basically used for programming
Hive operates on the server side of a HDFS cluster Pig operates on the client side of a HDFS cluster
Hive is very helpful in the areas of ETL Pig is a wonderful ETL tool for Big Data (for its powerful transformation and processing capabilities)
Hive has an ability to start an optional thrift based server which is used to send queries from any part to the Hive Server directly to execute Pig does not provide any such provision for this feature
Hive leverages upon the SQL DLL language with definitions to tables upfront and storing the schema details on a local database There is no provision of maintaining a dedicated metadata database and hence the schemas / data types are defined in the actual scripts itself in Pig
There is no provision from Hive to support Avro Pig provides support to Avro
There is no provision of installation for Hive as it is completely shell based for interaction Pig on the other had can be installed very easily
Hive provisions partitions on the data to process subsets based on dates or in chronological orders Pig do not provision anything like partitions directly but the feature can be achieved using Filters
There is no provision in Hive for illustrations Pig renders sample data for each of its scenarios through Illustrate function
There is a provision in Hive to access raw data Raw data access is not possible with Pig Latin scripts as fast as HiveQL
In Hive, a user can join data, order data and even can sort data dynamically (in an aggregated manner though) There is a provision from Pig to perform OUTER JOINS using the COGROUP feature.
 

Conclusion:

To come to a conclusion after having gone through all the necessary details to understand Big Data, Hadoop ecosystem and the components of choice of this article – Hive and Pig, it is clearly understood that there is no battle between Hive and Pig as such. Each of this provide a unique way of analyzing Big Data with reasonable affection towards Pig from the programmer community and Hive being admired by database developers.

 

List of Other Big Data Courses:

 Hadoop Adminstartion  MapReduce
 Big Data On AWS  Informatica Big Data Integration
 Bigdata Greenplum DBA  Informatica Big Data Edition
 Hadoop Hive  Impala
 Hadoop Testing  Apache Mahout

 


Popular Courses in 2018

Get Updates on Tech posts, Interview & Certification questions and training schedules