Blog

What is Pig Latin Hadoop?

Pig Latin Hadoop

Pig is made up of two components: the first is the language itself, which is called Pig Latin and the second is a runtime environment where Pig Latin programs are executed.
 
Hence, Pig Latin is a DATA FLOW LANGUAGE rather than the procedural or declarative
 
It gives support for nested types and operates on files in HDRS.
 
A pig Latin program consists of a collection of statements.
 
A Statement can be thought of as an operation or a command.
 
     For example, GROUP Operation is a type of statement.
 
Pig Latin also has a very rich syntax. It supports operators for the following operations:
  • Loading and storing of data
  • Streaming data
  • Filtering data
  • Grouping and joining data
  • Sorting data
  • Combining and splitting data
Pig Latin even supports a wide variety of types, expressions, functions, diagnostic operators, macros, and file system commands.
 

Relational operations:

CROSS:-

 Compute the cross product of two or more relations.
 
         Ex:-We have two text files cross 1.txt, cross 2.txt
 
                                                                                    
                                                                       123          456
                                                                       234          567
                                                                       345
 
           Create a script file for pig i.e
 
#>vi cross script. Pig

   Write the script as below

A=load ‘cross1.txt’ using pig storage (‘’)as

(p:int,q:int,r=int);

B=load ‘cross2.txt’ using pig storage (‘’)as

(x:int,y:int,z=int);

C= cross A,B;

D=Order C  by $0; store D  into ‘cross-output’;

    Save the file and run the script.

i.e.
# > pig –x local cross script. Pig

>cd cross-output

cross-output>cat part –r -00000
 
Output is:
 
123 567
123 456
234 567
234 456
345 456
345 546
 
UNION:-
 
Capture 15 Two text files
T1 T2
123 456
234 567
345  
 
Write the script as below :
 
#>vi union script. Pig

A= load ‘T1-txt’ using pig storage(‘’)as

(p:int,q:int,r=int);

B=load ‘T2.txt’ using pig storage (‘’)as

(x:int,y:int,z=int);

C= A Union B;

Shore C into ‘Union-out put’;
 
Run the script as below :
 
>Pig – x local union script – pig

>cd union-output

Union-output>cat part-r-00000
1 2  3
2 3  4
3 4  5
4 5  6
5 6  7
 
JOIN:- Performs as inner join of two or more relations based on common field values
 
Cmd:> vi join1.txt

100   Gopal      23000    Hyd

101   Raj        24000    Pune

102   Rajesh     25000    Kerala

103   Rakesh     45000    Bangalore

>vi join2.txt

300   Jaypal     20000    Hyd

101   Raj        24000    Pune

>vi join script. Pig

data1=load’join1.txt’ using pig storage(‘’)as

(id: int, name: char array, salary: int, Address :c$);

data2=load’join2.txt’ using pig storage(‘’)as(id: int, Address :c$);

Join Data=join Data1 by id Data2 by id;

Store Join Data into ’Join- Output’;

>Pig – x local Join script .pig
 
cd join- Output
cat part-r-00000

Aggregated Functions:-

#> vi Agg.txt
 
111/Raja/29
222/Rama/24
333/Ram/24
444/Rakesh/24
555/Rakesh/29
666/Ravi varma/23
777/Raju/23
grunt A= Load ‘Agg. true’ using pig storage(‘1’) as(id: int, Name: char array age: Int);
grunt> B =group A by age;
grunt> C =for each B generate group, COUNT (A-id);
grunt> DUMP B; Dump C;
 
Output: –    23,2
24,3
29,2
 
MAX:   Grunt> C =for each B generate group, MAX (A-id);
 
                                                                                    
                                                 Gives the MAX id of the group with Age.
 
SUM:- Grunt> C =for each B generate group, SUM (A-id);
 
Output :-   
(23, 1443)
(24,999)
(29,666)
 
Tokenize:-
 
grunt> A =Load ’Token.txt’ as(record: char array);

grunt> B = for each A generate TOKENIEE(record);

grunt> Dump B;
 
DISTINCT:-
 
Removes duplicate tuples in a relation.
 
DISTINCT does not preserve the original order of the contents. To eliminate duplicates, pig must first sort the dcd
 
You cannot use DISTINCT on a subset of fields.
 
Ex:-grunt>A =load ‘data’ Ad(az:int,az:int,a3:int);

DUMP A;
 
Output:  (8,3,4)(4,3,3)(1,2,3)
grunt>X =DISTINCT A;
Grunt>DUMP X;
 
O/P B:
(1,2,3)
(4,3,3)
(8,3,4)
 
FILTER:-

 Select tuples from a relation based on same condition.

Ex:- grunt>A=Load ’date’ as(a1:int, a2:int, a3:int);
Grunt>DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
 
grunt>X=FILTER A By .f3==3;
grunt>DUMP X;               
                               Third field equals 3
(1,2,3)
(4,3,3)
(8,4,3)
grunt>X=FILTER A By (f3==8)OR(NOT(f2+f3>f1));
grunt>DUMP X;
 
GROUP:-
 
  Groups the data in one or more relations,
Note:-The group and COGROUP Operators are identical.
 
Both operators work with one or more relations
  GROUP is used in statements involving one relations.
 
 5 CO GROUP is used in statements involving two or more relations
 
  You can CO GROUP upto but not more than 127 relations at a time.
 
Ex:-
grunt>A=load ‘student’ As(name char array, age: int, gpa: float);

grunt>DESCRIBE A;

A:{name: char array, age: int, gpa: float}

Grunt>DUMP A;

(John, 18, 4.OF)

(Mary, 19, 3.8F)

(Bill, 20, 3.9F)

(Joe, 18, 3.8F)
 
  Now, suppose we group relation A on field “age” for form relation B
 
grunt>B=GROUP A By age;

grunt>DESCRIBE B;

grunt>ILLUSTRATE B;
 
O/P:  etc—
 
B Group:int A: bag({name: char array, age: int, gpa: float})
                             18                                 {(John,18,40),(Joe,18,3.8)}
  20 {(Bill,20,3.9)}
 
grunt>DUMP B;
 
O/P:  
(18,{John,18,4.OF),(Joe,18,3.8F)})
(19,{many,19,3. 8F)})
(20,{(Bill, 20,3. 9F)})
 
PARTITION BY:-
 
  To use the Hadoop practitioner, add PARTITION BY clause to the appropriate operator:
 
Ex:-
Grunt>A=LOAD ‘input. data’;

B=GROUP A By $o PARTITION BY

 Org. apache. pig. text. utils.

Simple custom practitioner PARALLEL2;
Code for simple custom practitioner:-
 
Public class simple custom practitioner extends practitioner

{

Public int get partition pig nullable writable key, writable value, int num partitions)

{

If

(key. get value As pig type()instance of integer)

{

Int ref=(((integer)key. get value As Pig type()). Int value()

%num partitions);

Return set;

}

Else

{

Return(key. hash code())% num partitions;

}

}

}
 
LIMIT:-
 
  Limit the number of output tuples.
 
  If the specified number of output tuples is equal to or exceeds the number of tuples in the relation, all tuples in the relation are returned.
 
Note:-The LIMIT Operator allows pig to avoid processing all tuples in a relation.
 
  In most cases,  a query that uses LIMIT will run more efficiently than an identical query that does no use Limit and it is always a good idea to use limit if you can.
 
Ex:-GRUNT>A=load’a.txt’;

GRUNT>B=group A all;

GRUNT>C=for each B generate COUNT(A)as sum;

GRUNT>D=order A by $o;

GRUNT>E=limit D, C.sum/100;
 
  Suppose, we have relation A
GRUNT>A= load ’data’ As(a1:int,a2:int,a3:int);

Grunt>DUMP A
 
O/P:
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
 
  In this Ex, o/p is limited to 3 tuples, and note that there is no guarantee which 3 tuples will be o/p.
Grunt>X=LIMIT A 3;
Grunt>dump–x;
 
o/p:
(1,2,3)
(4,3,3)
(7,2,5)

RELATED COURSES

Get Updates on Tech posts, Interview & Certification questions and training schedules