In today’s fast-paced world, various organizations tend to gather a huge amount of data posted online. Popular websites such as Facebook and Instagram and emails make use of the ‘Big Data’ technology to store and analyze the data for later use.
Big data gets the support of an innovative solution called ‘Hadoop’ to meet its requirements. The Hadoop framework enables the users to have the flexibility in other languages like C, C++, and Python, etc.
The programmers who are well-versed in scripting are using PIG and HIVE for SQL. Nowadays, many people employ the PIG to reap its features and benefits in the data manipulation process. Go through this article to know essential information about What Apache Pig.
PIG is a high-level scripting language commonly used with Apache Hadoop to analyze large data sets. The PIG platform offers a special scripting language known as PIG Latin to the developers who are already familiar with the other scripting languages, and programming languages l+ike SQL.
The major benefit of PIG is that it works with data that are obtained from various sources and store the results into HDFS (Hadoop Data File System). The programmers have to write the scripts in PIG Latin language which are then converted into Map and reduce tasks with the Pig Engine component (Apache Pig has a component called Pig Engine. It usually accepts the Pig Latin scripts to convert them into MapReduce jobs).
Get trained on MapReduce, Pig, Hive, HBase, and Apache Spark with the Big Data "Hadoop Certification Training" Course. Click to enroll now! |
MapReduce is one of the programming models that are widely used for processing a large amount of data. The MapReduce algorithm consists of two tasks; Map and Reduce. The Map process acquires the data set and converts them into another set of data which are then broken down into smaller sets called Tuples, i.e., Key/Value pairs.
The Apache PIG was developed by Yahoo to create and manipulate MapReduce tasks on the dataset in 2006. The Apache Pig was open-sourced through Apache incubator in the year 2007. The Apache Pig was released in 2008 and it is declared as a top-level research project in 2010.
Generally, the Apache Pig gives an abstraction to reduce the complexity of developing MapReduce Programming for the developers. The common reason for using the Pig is that it gives hand to write short programs.
The advanced features of Apache Pig enable the programmers to do more work than other frameworks. This also eases the life of a data engineer in maintaining various ad hoc queries on the data sets. In fact, Apache Pig is a boon for all programmers and so it is most recommended to use in data management.
Related Article: Apache Pig Interview Questions |
As said before, the Apache Pig is mainly used to analyze huge sets of data and to represent them as data flows. The programming feature of the Pig yields more advantages to its users. Here are the major advantages of using Pig in data sets.
Ease of Programming – Pig Latin is similar to the SQL language and so it is simple to develop for the programmers who are experts in SQL.
Helpful for Programmers- The programmers who are less knowledgeable in Java need to face many difficulties in Hadoop. In such a case, Apache Pig is very important to handle various tasks, especially MapReduce.
Multi-Query Approach - The Apache Pig deploys the multi-query approach which helps you to reduce the length of codes thus resulting in less development time.
Optimization Opportunities – In Apache Pig, the tasks are optimized automatically which helps the programmers to focus more on the semantics of the language.
Extensibility – The existing operators of the Apache Pig can be used to develop the main functions to read, write, and process data.
More Additional Operators- It provides different built-in operators to improve the data operations such as joins, filters, ordering, etc. You can also use the nested data types like tuples and maps that are not available in MapReduce.
User-Defined Functions – The major advantage of Apache Pig is that it allows you to create the user-defined functions in other programming languages like Java, Ruby, Perl, and Python, and invoke them into it.
Manipulate All Kind of Data – You can be able to analyze all kinds of data including structured and unstructured data that are collected from various sources and stores it in HDFS.
No Compilation – Since the Apache Pig converts the operator internally into MapReduce, there is no need for the compilation process.
Related Page: Understanding Data Parallelism in MapReduce |
The architecture of Apache Pig can be defined based on two components,
1) Pig Latin – Language of Apache Pig
The main architecture of the Apache Pig is its own language that enables the developers to write data processing and analyze the programs.
2) A Runtime environment – Platform for running Pig Latin programs
The PigLatin Compiler is defined as the runtime environment which converts the Pig source code into executable code. Generally, most of the executable code exists in the form of MapReduce form.
First of all, the programmers have to write Pig scripts and analyze them. These pig scripts are processed with the help of Apache Pig components such as a parser, optimizer, compiler, and finally to the execution engine.
Now, you will be able to get the executable code of Pig Latin which is to be converted into MapReduce tasks. The MapReduce tasks are then stored in the Hadoop Distributed File System (HDFS).
Related Page: Prerequisites for Learning Hadoop |
The main prerequisites for downloading Apache Pig are that you should have installed Java and Hadoop on your system. After the perfect setup, go through the following steps to download and install the Apache Pig.
First of all, you have to download the latest version of the Apache Pig from the official website.
Step 1: Open the homepage of the Pig website and make a click on the link Release Page that is seen under the section News. Do the process as per the image that is given below.
Step 2: You will be navigated to the Apache Pig Releases page. Move to the Download option where you can find two links; Pig 0.8 and later and Pig 0.7 and before. In order to have the latest Pig releases, click on the Pig 0.8 and later links. This redirects to the page which consists of a set of mirrors.
Step 3: On this site, you have to choose and click on the mirror which is given as per the image.
Step 4: The mirror that you have selected redirects you to the Pig Releases page. Here, you can view various versions of Apache Pig from which you have to click on the latest version.
Step 5: You can find some folders on the page which contain the source and binary files of Apache Pig in different distributions. Download the tar files of the source and binary files of Apache Pig 0.15, pig0.15.0-src.tar.gzand pig-0.15.0.tar.gz.
Now, the download process of Apache Pig is completed successfully which can be found in the download folder of the files.
Check Out Hadoop Tutorials |
After downloading the Apache Pig software, it must be installed in the Linux environment.
Step 1: Create a directory and give the name Pig in the directory where the installation directories of Java, Hadoop, and other software are usually installed.
$ mkdir Pig
$ cd Downloads/
$ tar zxvf pig-0.15.0-src.tar.gz
$ tar zxvf pig-0.15.0.tar.gz
Step 3: Move the content of the pig-0.15.0-src.tar.gz file to the Pig directory that was created before. Complete the process as shown below.
$ mv pig-0.15.0-src.tar.gz/* /home/Hadoop/Pig/
When you complete these processes in the right way, the downloaded Apache Pig will be installed in your system successfully.
After the successful installation of Apache Pig, you need to configure it for further process. You must have two files to configure; .bashrc and pig.properties.
Set the following variables in the .bashrc file,
PIG_Home folder to the installation folder of Apache Pig
Change PATH environment variable to the bin folder and
Change the PIG_CLASSPATH environment variable to the configuration folder of the Hadoop installation.
pig.properties file
In the pig configuration folder, you can find the pig.properties in which you have to set different parameters as given below.
pig -h properties
Related Page: Hadoop Jobs Salary Trends in the USA |
Once you complete the configuration process you need to verify the installation of Apache Pig in your system. Type the version command to verify the correctness of installation. If the installation is successful, you will get the version of Apache Pig as a compiled message.
$ pig –version
Apache Pig version 0.15.0 (r1682971)
compiled Jun 01 2015, 11:44:35
Apache Pig Run Modes
Basically, the Apache Pig has two execution or run modes, and they are
Local Mode: In the local execution mode, the Pig runs in a single JVM (Java Virtual Machine) and it uses the local file system to store the data. This local mode is suitable for analyzing a small set of data using Apache Pig.
MapReduce Mode: In the MapReduce execution mode, the Pig Latin queries are converted into MapReduce tasks to run on the clusters of Hadoop. The MapReduce with a fully distributed Hadoop cluster is best for executing the large datasets.
Related Page: Hadoop HDFS Commands |
There are various components available in Apache Pig that improve the execution speed. Pig Latin consists of nested data models that permit complex non-atomic data types. Some of them are
Field: A small piece of data or an atomic value is referred to as the field. This atom has a single value in Pig Latin with any data type. This field is stored as a string which can be used as both string and number. Different atomic values of Pig are int, float, double, char, long, and byte array. Ex: ‘12’ or ‘Apache’.
Tuples: The record which is formed by the ordered set of fields is called a tuple and it is based on any data type. Tuples are similar to rows that are found in the tables of RDBMS. Ex: (30, Apache)
Bags: The term Bag refers to the collection of an unordered set of tuples that consists of any number of tuples. The bags are represented by the symbol {}. It is not necessary that tuple should contain the same number of fields or columns in the same type. Ex: { (5, Pig), (10, Apache)}.
Map: The map is also known as a data map that includes a set of many key-value pairs it. The key should be unique and the type of char array but its value can be any type.
Relation: The relation can be explained as the collection or bag of tuples. The relation type in Pig Latin is unordered.
Related Page: Cloudera Hadoop Certification |
The data types in Apache pig are classified into two categories; Primitive and Complex
Type | Description |
Primitive Data Type | |
Int | 32 bit signed integer |
Long | 64 bit signed integer |
Float | 32-bit floating-point |
Double | 32-bit floating-point |
CharArray | Character Array |
byteArray | Byte Array |
Complex Data Type | |
Tuple | Set of ordered fields |
Bag | Collection of tuples |
Map | Collection of tuples |
The User Defined Function (UDF) of the Apache Pig can be used to define your own functions. In common, the UDF function support is provided in six programming languages; Java, Jython, JavaScript, Python, Ruby, and Groovy. With the help of Java, you can write UDF to various processes like data loading or storing, column transformation, and aggregation.
In Apache Pig, you can have the Java repository for UDF called Piggybank. The piggybank provides access to Java UDFs that are written by others and include your own UDFs. There are three types of UDFs in Java, they are:
Filter function
Eval function
Algebraic function
You can create your own UDF for Apache Pig by writing the User Defined Functions using Java, generate the jar file and start to use it. All the UDFs must extend "org.apache.pig.EvalFunc" and every function must override the ‘exec’ method. Here is the example for Pig script, EVAL function to convert to upper.
[Related Page: Reasons to Learn to Hadoop & Hadoop Administration] |
Create the jar file for the above code as myudfs.jar.
packagemyudfs;
importjava.io.IOException;
importorg.apache.pig.EvalFunc;
importorg.apache.pig.data.Tuple;
public class UPPER extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
returnstr.toUpperCase();
}catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}
At last, execute the script in the terminal to get the output.
-- script.pig
REGISTER myudfs.jar;
A = LOAD 'data' AS (name: chararray, age: int, gpa: float);
B = FOREACH A GENERATE myudfs.UPPER(name);
DUMP B;?
Frequently Asked Hadoop Interview Questions |
Example Pig Script
For instance, create the Pig script to find the number of products that are sold in each country.
The input to the sample Pig script is given as a CSV file, SalesJAn2009.CSV
Step 1: Start Hadoop in your system
Step 2: Pig takes a file from HDFS in MapReduce mode and stores the results back to the HDFS. You have to copy the file SalesJan2009.CSV that is stored in the local file system to the HDFS home directory.
Step 3: Configuring the Pig
Start navigating to $PIG_HOME/conf
Open pig.properties using the best text editor and mention the path of log file using pig.logfile
The given logger will make use of the files to correct the errors
Step 4: Type ‘Pig’ in the run command to start the command prompt which is an interactive shell Pig query.
Step 5: Open the Grunt command prompt for Pig and run the below commands in an order
a. Load the file that contains data
Enter the below command
b. Group the data by field country as shown in the below image
c. Each tuple in ‘GroupbyCountry’ generates the strings of the form; Name of Country: No. of. Products sold.
Enter the below-given command
d. Store the results in the directory named ‘pig_output_sales’ on HDFS
Give some time to execute the command and once done you should view the below screen.
Related Page: Big Data Analytics |
Step 6: The result is seen through the command interface as
Results through a web interface
Select ‘Browse the filesystem’ and navigate up to /user/hduser/pig_output_sales
Open part-r-00000
Now, you can see the result for the given input data in the Apache Pig script.
Related Page: Hadoop Installation and Configuration |
Thus, these are the things you want to know about the Apache Pig that analyzes the data in Hadoop. It is one of the greatest tools of ETL and manages the data flow workloads. So, learn the Apache Pig and make use of it in the Hadoop ecosystem to maintain a huge amount of datasets.
Hadoop Administration | MapReduce |
Big Data On AWS | Informatica Big Data Integration |
Bigdata Greenplum DBA | Informatica Big Data Edition |
Hadoop Hive | Impala |
Hadoop Testing | Apache Mahout |
Our work-support plans provide precise options as per your project tasks. Whether you are a newbie or an experienced professional seeking assistance in completing project tasks, we are here with the following plans to meet your custom needs:
Name | Dates | |
---|---|---|
Hadoop Training | Nov 23 to Dec 08 | View Details |
Hadoop Training | Nov 26 to Dec 11 | View Details |
Hadoop Training | Nov 30 to Dec 15 | View Details |
Hadoop Training | Dec 03 to Dec 18 | View Details |
Ravindra Savaram is a Technical Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.