Blog

What is Pig Latin Hadoop?

  • (4.0)
  •   |   566 Ratings

What is Pig Latin Hadoop?

Pig Latin Hadoop

Pig is made up of two components: the first is the language itself, which is called Pig Latin and the second is a runtime environment where Pig Latin programs are executed.

 Hence, Pig Latin is a DATA FLOW LANGUAGE rather than the procedural or declarative

  •  It gives support for nested types and operates on files in HDRS.
  •  A pig Latin program consists of a collection of statements.
  •  A Statement can be thought of as an operation or a command.

For example, GROUP Operation is a type of statement.

Pig Latin also has a very rich syntax. It supports operators for the following operations:

  • Loading and storing of data
  • Streaming data
  • Filtering data
  • Grouping and joining data
  • Sorting data
  • Combining and splitting data
  • Pig Latin even supports a wide variety of types, expressions, functions, diagnostic operators, macros, and file system commands.
Inclined to build a profession as Hadoop Developer? Then here is the blog post on Hadoop Training Online.

Relational operations:

CROSS:-

Compute the cross product of two or more relations.

         Ex:-We have two text files cross 1.txt, cross 2.txt

                                                                                    

                                                                       123          456

                                                                       234          567

                                                                       345

           Create a script file for pig i.e

#>vi cross script. Pig

   Write the script as below

A=load ‘cross1.txt’ using pig storage (‘’)as
(p:int,q:int,r=int);
B=load ‘cross2.txt’ using pig storage (‘’)as
(x:int,y:int,z=int);
C= cross A,B;
D=Order C  by $0; store D  into ‘cross-output’;

    Save the file and run the script.

i.e.

# > pig –x local cross script. Pig
>cd cross-output
cross-output>cat part –r -00000

Output is:

123 567
123 456
234 567
234 456
345 456
345 546

UNION:-

Capture 15 Two text files
T1 T2
123 456
234 567
345  
 
Write the script as below :
#>vi union script. Pig
A= load ‘T1-txt’ using pig storage(‘’)as
(p:int,q:int,r=int);
B=load ‘T2.txt’ using pig storage (‘’)as
(x:int,y:int,z=int);
C= A Union B;
Shore C into ‘Union-out put’;
Run the script as below :
Pig – x local union script – pig
cd union-output
Union-output>cat part-r-00000
1 2  3
2 3  4
3 4  5
4 5  6
5 6  7

JOIN:- Performs an inner an join of two or more relations based on common field values

Cmd:> vi join1.txt
100   Gopal      23000    Hyd
101   Raj        24000    Pune
102   Rajesh     25000    Kerala
103   Rakesh     45000    Bangalore
>vi join2.txt
300   Jaypal     20000    Hyd
101   Raj        24000    Pune

>vi join script. Pig
data1=load’join1.txt’ using pig storage(‘’)as
(id: int, name: char array, salary: int, Address :c$);
data2=load’join2.txt’ using pig storage(‘’)as(id: int, Address :c$);
Join Data=join Data1 by id Data2 by id;
Store Join Data into ’Join- Output’;
>Pig – x local Join script .pig
cd join- Output
cat part-r-00000

Aggregated Functions:-

 #> vi Agg.txt
 
111/Raja/29
222/Rama/24
333/Ram/24
444/Rakesh/24
555/Rakesh/29
666/Ravi varma/23
777/Raju/23
grunt A= Load ‘Agg. true’ using pig storage(‘1’) as(id: int, Name: char array age: Int);
grunt> B =group A by age;
grunt> C =for each B generate group, COUNT (A-id);
grunt> DUMP B; Dump C;
 
Output: –    23,2
24,3
29,2
 
MAX:   Grunt> C =for each B generate group, MAX (A-id);
 
                                                                                    
                                                 Gives the MAX id of the group with Age.
 
SUM:- Grunt> C =for each B generate group, SUM (A-id);
 
Output :-   
(23, 1443)
(24,999)
(29,666)
 
Tokenize:-
grunt> A =Load ’Token.txt’ as(record: char array);
grunt> B = for each A generate TOKENIEE(record);
grunt> Dump B;
DISTINCT:-
  • Removes duplicate tuples in a relation.
  • DISTINCT does not preserve the original order of the contents. To eliminate duplicates, pig must first sort the dcd
 

Frequently asked Hadoop Interview Questions

You cannot use DISTINCT on a subset of fields.
Ex:-grunt>A =load ‘data’ Ad(az:int,az:int,a3:int);
DUMP A;
Output:  (8,3,4)(4,3,3)(1,2,3)
grunt>X =DISTINCT A;
Grunt>DUMP X;
 
O/P B:
(1,2,3)
(4,3,3)
(8,3,4)
 
FILTER:-

 Select tuples from a relation based on same condition.

Ex:- grunt>A=Load ’date’ as(a1:int, a2:int, a3:int);
Grunt>DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
 
grunt>X=FILTER A By .f3==3;
grunt>DUMP X;               
                               Third field equals 3
(1,2,3)
(4,3,3)
(8,4,3)
grunt>X=FILTER A By (f3==8)OR(NOT(f2+f3>f1));
grunt>DUMP X;

GROUP:-

Groups the data in one or more relations,
Note:-The group and COGROUP Operators are identical.
 
Both operators work with one or more relations
  • GROUP is used in statements involving one relations.
  • 5 CO GROUP is used in statements involving two or more relations
  • You can CO GROUP upto but not more than 127 relations at a time.
 
Ex:-
grunt>A=load ‘student’ As(name char array, age: int, gpa: float);
grunt>DESCRIBE A;
A:{name: char array, age: int, gpa: float}
Grunt>DUMP A;
(John, 18, 4.OF)
(Mary, 19, 3.8F)
(Bill, 20, 3.9F)
(Joe, 18, 3.8F)
Now, suppose we group relation A on field “age” for form relation B
grunt>B=GROUP A By age;
grunt>DESCRIBE B;
grunt>ILLUSTRATE B;
 
O/P:  etc—
 
B Group:int A: bag({name: char array, age: int, gpa: float})
                             18                                 {(John,18,40),(Joe,18,3.8)}
  20 {(Bill,20,3.9)}
 
grunt>DUMP B;
 
O/P:  
(18,{John,18,4.OF),(Joe,18,3.8F)})
(19,{many,19,3. 8F)})
(20,{(Bill, 20,3. 9F)})

PARTITION BY:-

To use the Hadoop practitioner, add PARTITION BY clause to the appropriate operator:
 
Ex:-
Grunt>A=LOAD ‘input. data’;
B=GROUP A By $o PARTITION BY
Org. apache. pig. text. utils.
Simple custom practitioner PARALLEL2;

Code for simple custom practitioner:-

Public class simple custom practitioner extends practitioner
{
Public int get partition pig nullable writable key, writable value, int num partitions)
{
If
(key. get value As pig type()instance of integer)
{
Int ref=(((integer)key. get value As Pig type()). Int value()
%num partitions);
Return set;
}
Else
{
Return(key. hash code())% num partitions;
}
}
}

LIMIT:-

Limit the number of output tuples.

If the specified number of output tuples is equal to or exceeds the number of tuples in the relation, all tuples in the relation are returned.

 
Note:-The LIMIT Operator allows the pig to avoid processing all tuples in a relation.

In most cases,  a query that uses LIMIT will run more efficiently than an identical query that does no use Limit and it is always a good idea to use limit if you can.

Ex:-GRUNT>A=load’a.txt’;
GRUNT>B=group A all;
GRUNT>C=for each B generate COUNT(A)as sum;
GRUNT>D=order A by $o;
GRUNT>E=limit D, C.sum/100;
  • Suppose, we have relation A
GRUNT>A= load ’data’ As(a1:int,a2:int,a3:int);
Grunt>DUMP A
O/P:
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)

In this Ex, o/p is limited to 3 tuples, and note that there is no guarantee which 3 tuples will be o/p.

Grunt>X=LIMIT A 3;
Grunt>dump–x;

o/p:
(1,2,3)
(4,3,3)
(7,2,5)
 
Explore Hadoop Sample Resumes! Download & Edit, Get Noticed by Top Employers!Download Now!

List of Other Big Data Courses:

 Hadoop Administration  MapReduce
 Big Data On AWS  Informatica Big Data Integration
 Bigdata Greenplum DBA  Informatica Big Data Edition
 Hadoop Hive  Impala
 Hadoop Testing  Apache Mahout

 


Popular Courses in 2018

Get Updates on Tech posts, Interview & Certification questions and training schedules