Blog

What is Pig Latin Hadoop?

  • (4.0)
  • | 1952 Ratings

What is Pig Latin Hadoop?


In today’s fast-paced world, various organizations tend to gather a huge amount of data posted online. Popular websites such as Facebook and Instagram, and emails make use of the ‘Big Data’ technology to store and analyze the data for later use.


Big data gets the support of an innovative solution called ‘Hadoop’ to meet its requirements. The Hadoop framework enables the users to have the flexibility in other languages like C, C++, and Python, etc. 


The programmers who are well-versed in scripting are using PIG and HIVE for SQL. Nowadays, many people employ the PIG to reap its features and benefits in the data manipulation process. Go through this article to know essential information about Apache Pig. 



What is Apache Pig?


PIG is a high-level scripting language commonly used with Apache Hadoop to analyze large data sets.  The PIG platform offers a special scripting language known as PIG Latin to the developers who are already familiar with the other scripting languages, and  programming languages like SQL. 


The major benefit of PIG is that it works with data that are obtained from various sources and store the results into HDFS (Hadoop Data File System). The programmers have to write the scripts in PIG Latin language which are then converted into Map and reduce tasks with the Pig Engine component (Apache Pig has a component called Pig Engine. It usually accepts the Pig Latin scripts to convert them into MapReduce jobs). 


Interested in mastering Hadoop? Enroll now for a FREE demo on Hadoop training

 

MapReduce 


MapReduce is one of the programming models that are widely used for processing a large amount of data. The MapReduce algorithm consists of two tasks; Map and Reduce. The Map process acquires the data set and converts them into another set of data which are then broken down into smaller sets called Tuples, i.e., Key/Value pairs. 


History of Apache Pig


The Apache PIG was developed by Yahoo to create and manipulate MapReduce tasks on the dataset in 2006. The Apache Pig was open sourced through Apache incubator in the year 2007. The Apache Pig was released in 2008 and it is declared as a top-level research project in 2010. 


[Related Page: Introduction to HDFS]

 

Why should we use Apache Pig?


Generally, the Apache Pig gives an abstraction to reduce the complexity of developing MapReduce Programming for the developers. The common reason for using the Pig is that it gives hand to write short programs. 


The advanced features of Apache Pig enable the programmers to do more work than other frameworks. This also eases the life of a data engineer in maintaining various ad hoc queries on the data sets. In fact, Apache Pig is a boon for all the programmers and so it is most recommended to use in data management. 


[Related Page: Hadoop Heartbeat and Data Block Rebalancing]


Advantages of Pig


As said before, the Apache Pig is mainly used to analyze the huge sets of data and to represent them as data flows. The programming feature of the Pig yields more advantages to its users. Here are the major advantages of using Pig in data sets.


  • Ease of Programming – Pig Latin is similar to the SQL language and so it is simple to develop for the programmers who are experts in SQL. 
  • Helpful for Programmers- The programmers who are less knowledgeable in Java need to face many difficulties in Hadoop. In such a case, Apache Pig is very important to handle various tasks, especially the MapReduce. 
  • Multi-Query Approach - The Apache Pig deploys the multi-query approach which helps you to reduce the length of codes thus resulting in less development time. 
  • Optimization Opportunities – In Apache Pig, the tasks are optimized automatically which helps the programmers to focus more on the semantics of the language. 
  • Extensibility – The existing operators of the Apache Pig can be used to develop the main functions to read, write, and process data. 
  • More Additional Operators- It provides different built-in operators to improve the data operations such as joins, filters, ordering, etc. You can also use the nested data types like tuples and map that are not available in MapReduce. 
  • User Defined Functions – The major advantage of Apache Pig is that it allows you to create the user-defined functions in other programming languages like Java, Ruby, Perl, and Python, and invoke them into it. 
  • Manipulate All Kind of Data – You can be able to analyze all kinds of data including structured and unstructured data that are collected from various sources and stores it in HDFS. 
  • No Compilation – Since the Apache Pig converts the operator internally into MapReduce, there is no need for compilation process. 

[Related Page: Understanding Data Parallelism in MapReduce]


The Architecture of Apache Pig


The architecture of Apache Pig can be defined based on two components,


Pig Latin – Language of Apache Pig


The main architecture of the Apache Pig is its own language that enables the developers to write data processing and analyze the programs. 


A Runtime environment – Platform for running Pig Latin programs


The PigLatin Compiler is defined as the runtime environment which converts the Pig source code into executable code. Generally, most of the executable code exists in the form of MapReduce form. 


First of all, the programmers have to write Pig scripts and analyze them. These pig scripts are processed with the help of Apache Pig components such as a parser, optimizer, compiler and finally to the execution engine. 


Now, you will be able to get the executable code of Pig Latin which is to be converted into MapReduce tasks. The MapReduce tasks are then stored into the Hadoop Distributed File System (HDFS).


[Related Page: Prerequisites for Learning Hadoop]


How to Download and Install Apache Pig?


The main prerequisites for downloading Apache Pig are that you should have installed the Java and Hadoop on your system. After the perfect set up, go through the following steps to download and install the Apache Pig.

 

Steps for Downloading Apache Pig 


First of all, you have to download the latest version of the Apache Pig from the official website https://pig.apache.org/ 


Image Source: www.tutorialspoint.com


Step 1: Open the homepage of the Pig website and make a click on the link Release Page that is seen under the section News. Do the process as per the image that is given below. 


Step 2: You will be navigated to the Apache Pig Releases page.  Move to the Download option where you can find two links; Pig 0.8 and later and Pig 0.7 and before. In order to have the latest Pig releases, click on the Pig 0.8 and later links. This redirects to the page which consists of a set of mirrors.


Step 3: On this site, you have to choose and click on the mirror which is given as per the image http://www.us.Apache.org/dist/pig 


Step 4: The mirror that you have selected redirects you to the Pig Releases page. Here, you can view various versions of Apache Pig from which you have to click on the latest version. 


Step 5: You can find some folders on the page which contains the source and binary files of Apache Pig in different distributions. Download the tar files of the source and binary files of Apache Pig 0.15, pig0.15.0-src.tar.gzand pig-0.15.0.tar.gz.

 
 
Now, the download process of Apache Pig is completed successfully which can be found in the download folder of the files.
 
 
 

Steps for Installing Apache Pig


After downloading the Apache Pig software, it must be installed in the Linux environment. 


Step 1: Create a directory and give the name as Pig in the directory where the installation directories of Java, Hadoop, and other softwares are usually installed. 

$ mkdir Pig
 
 
Step 2: Extract the downloaded tar files as given below. 
 
$ cd Downloads/ 
$ tar zxvf pig-0.15.0-src.tar.gz 
$ tar zxvf pig-0.15.0.tar.gz 

[Related Page: Leading Hadoop Vendors in BigData]


Step 3: Move the content of the pig-0.15.0-src.tar.gz file to the Pig directory that was created before. Complete the process as shown below. 

$ mv pig-0.15.0-src.tar.gz/* /home/Hadoop/Pig/
 
When you complete these processes in the right way, the downloaded Apache Pig will be installed in your system successfully. 
 

Configuring the Apache Pig

 
After the successful installation of Apache Pig, you need to configure it for further process. You must have two files to configure; .bashrc and pig.properties. 
 
.bashrc file
 
  • Set the following variables in the .bashrc file, 
  • PIG_Home folder to the installation folder of Apache Pig
  • Change PATH environment variable to the bin folder and 
  • Change PIG_CLASSPATH environment variable to the configuration folder of the Hadoop installation.

pig.properties file


In the pig configuration folder, you can find the pig.properties in which you have to set different parameters as given below. 


pig -h properties 
 
 
 

Verifying the Installation of Apache Pig

 
 
Once you complete the configuration process you need to verify the installation of Apache Pig in your system. Type the version command to verify the correctness of installation. If the installation is successful, you will get the version of Apache Pig as a compiled message.
 
$ pig –version 
 
Apache Pig version 0.15.0 (r1682971)  
compiled Jun 01 2015, 11:44:35

Apache Pig Run Modes


Basically, the Apache Pig has two execution or run modes, and they are 


  • Local Mode: In the local execution mode, the Pig run in single JVM (Java Virtual Machine) and it uses the local file system to store the data. This local mode is suitable for analyzing the small set of data using Apache Pig. 
  • MapReduce Mode: In the MapReduce execution mode, the Pig Latin queries are converted into MapReduce tasks to run on the clusters of Hadoop. The MapReduce with fully distributed Hadoop cluster is best for executing the large datasets.  

[Related Page: Hadoop HDFS Commands]


Components of Pig Latin


There are various components available in Apache Pig which improve the execution speed. Pig Latin consists of nested data models that permit complex non-atomic data types. Some of them are 


  • Field: A small piece of data or an atomic value is referred to as the field. This atom has single value in Pig Latin with any data type. This field is stored as a string which can be used as both string and number. Different atomic values of Pig are int, float, double, char, long and byte array. Ex: ‘12’ or ‘Apache’.
  • Tuples: The record which is formed by the ordered set of fields is called as a tuple and it is based on any data type. Tuples are similar to rows that are found in the tables of RDBMS. Ex: (30, Apache)
  • Bags: The term Bag refers to the collection of an unordered set of tuples which consists of any number of tuples. The bags are represented by the symbol {}. It is not necessary that tuple should contain the same number of fields or column in the same type. Ex: { (5, Pig), (10, Apache)}.
  • Map: The map is also known as data maps that include a set of many key-value pairs in it. The key should be unique and type of char array but its value can be any type. 
  • Relation: The relation can be explained as the collection or bag of tuples. The relation type in Pig Latin is unordered.

[Related Page: Cloudera Hadoop Certification] 


Pig Data Types


The data types in Apache pig are classified into two categories; Primitive and Complex


Type  Description 
Primitive Data Type  
Int 32 bit signed integer
Long 64 bit signed integer
Float 32-bit floating point
Double 32-bit floating point
CharArray Character Array
byteArray Byte Array
Complex Data Type  
Tuple Set of ordered fields
Bag Collection of tuples
Map Collection of tuples

[Related Page: Reasons to Learn to Hadoop & Hadoop Administration]

 

Pig UDF (User Defined Functions)


The User Defined Function (UDF) of the Apache Pig can be used to define your own functions. In common, the UDF function support is provided in six programming languages; Java, Jython, JavaScript, Python, Ruby, and Groovy. With the help of Java, you can write UDF to various processes like data loading or storing, column transformation and aggregation.


In Apache Pig, you can have the Java repository for UDF called Piggybank. The piggybank provides access to Java UDFs that are written by others and include your own UDFs. There are three types of UDFs in Java, they are:


  • Filter function 
  • Eval function
  • Algebraic function

You can create your own UDF for Apache Pig by writing the User Defined Functions using Java, generate the jar file and start to use it. All the UDFs must extend "org.apache.pig.EvalFunc" and every function must override the ‘exec’ method. Here is the example for Pig script, EVAL function to convert to upper.


[Related Page: MapReduce In Bigdata ]


  • Create the jar file for the above code as myudfs.jar.

packagemyudfs;  
importjava.io.IOException;  
importorg.apache.pig.EvalFunc;  
importorg.apache.pig.data.Tuple;  
public class UPPER extends EvalFunc<String>  
 {  
public String exec(Tuple input) throws IOException {  
if (input == null || input.size() == 0)  
return null;  
try{  
            String str = (String)input.get(0);  
returnstr.toUpperCase();  
}catch(Exception e){  
throw new IOException("Caught exception processing input row ", e);  
        }  
    }  
  }  

  • Write the script in a file and save it as .pig

At last, execute the script in the terminal to get the output.


-- script.pig  
REGISTER myudfs.jar;  
A = LOAD 'data' AS (name: chararray, age: int, gpa: float);  
B = FOREACH A GENERATE myudfs.UPPER(name);  
DUMP B;?

Frequently Asked Hadoop Interview Questions


Example Pig Script 


For instance, create the Pig script to find the number of products that are sold in each country. 


The input to the sample Pig script is given as a CSV file, SalesJAn2009.CSV


Step 1: Start Hadoop in your system 


Step 2: Pig takes a file from HDFS in MapReduce mode and store the results back to the HDFS. You have to copy the file SalesJan2009.CSV that is stored in the local file system to HDFS home directory. 


Step 3: Configuring the Pig


Start navigating to $PIG_HOME/conf 


Open pig.properties using the best text editor and mention the path of log file using pig.logfile


The given logger will make use of the files to correct the errors


Step 4: Type ‘Pig’ in run command to start the command prompt which is an interactive shell Pig query. 


[Related Page: Hadoop Ecosystem]


Step 5: Open the Grunt command prompt for Pig and run the below commands in an order


a.    Load the file that contains data 


Enter the below command


b.    Group the data by field country as shown in the below image


c.    Each tuple in ‘GroupbyCountry’ generate the strings of the form; Name of Country: No. of. Products sold. 


Enter the below-given command


d.    Store the results in the directory named ‘pig_output_sales’ on HDFS


Give some time to execute the command and once done you should view the below screen. 


[Related Page: Big Data Analytics]


Step 6: The result is seen through the command interface as 


Results through a web interface


Open http://localhost:50070/ in the browser


Select ‘Browse the filesystem’ and navigate up to /user/hduser/pig_output_sales


Open part-r-00000


Now, you can see the result for the given input data in the Apache Pig script. 


[Related Page: Hadoop Installation and Configuration]


Final Thoughts


Thus, these are the things you want to know about the Apache Pig that analyzes the data in Hadoop. It is one of the greatest tools of ETL and manages the data flow workloads. So, learn the Apache Pig and make use of it in the Hadoop ecosystem to maintain huge amount of datasets. 


 
Explore Hadoop Sample Resumes! Download & Edit, Get Noticed by Top Employers!Download Now!

 

List of Big Data Courses:


Subscribe For Free Demo

Free Demo for Corporate & Online Trainings.

Ravindra Savaram
About The Author

Ravindra Savaram is a Content Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.


DMCA.com Protection Status

Close
Close