What is Pig Latin Hadoop ?

Pig Latin Hadoop

Pig is made up of two components: the first is the language itself, which is called Pig Latin and the second is a runtime environment where Pig Latin programs are executed.

Capture 15 Hence, Pig Latin is a data flow language rather than the procedural or declarative

Capture 15 It gives support for nested types and operates on files in HDRS.

Capture 15 A pig Latin program consists of a collection of statements.

Capture 15 A Statement can be thought of as an operation or a command.

     For example, GROUP Operation is a type of statement.

Pig Latin also has a very rich syntax. It supports operators for the following operations:

  • Loading and storing of data

  • Streaming data

  • Filtering data

  • Grouping and joining data

  • Sorting data

  • Combining and splitting data

Pig Latin even supports a wide variety of types, expressions, functions, diagnostic operators, macros, and file system commands.

Relational operations:


Capture 15 Compute the cross product of two or more relations.

         Ex:-We have two text files cross 1.txt, cross 2.txt

                                                                        Screenshot_1807           Screenshot_1807

                                                                       123          456

                                                                       234          567


           Create a script file for pig i.e

#>vi cross script. Pig
   Screenshot_1821 Write the script as below
A=load ‘cross1.txt’ using pig storage (‘’)as
B=load ‘cross2.txt’ using pig storage (‘’)as
C= cross A,B;
D=Order C  by $0; store D  into ‘cross-output’;

    Screenshot_1821 Save the file and run the script.


# > pig –x local cross script. Pig
>cd cross-output
cross-output>cat part –r -00000

Output is:

123   567
123 456
234 567
234 456
345 456
345 546


Capture 15 Two text files

T1 T2
123 456
234 567

Write the script as below :

#>vi union script. Pig
A= load ‘T1-txt’ using pig storage(‘’)as
B=load ‘T2.txt’ using pig storage (‘’)as
C= A Union B;
Shore C into ‘Union-out put’;

        Run the script as below :

>Pig – x local union script – pig
>cd union-output
Union-output>cat part-r-00000
1 2  3
2 3  4
3 4  5
4 5  6
5 6  7

JOIN:- Performs as inner join of two or more relations based on common field values

Cmd:> vi join1.txt
100   Gopal      23000    Hyd
101   Raj        24000    Pune
102   Rajesh     25000    Kerala
103   Rakesh     45000    Bangalore
>vi join2.txt
300   Jaypal     20000    Hyd
101   Raj        24000    Pune
>vi join script. Pig
data1=load’join1.txt’ using pig storage(‘’)as
(id: int, name: char array, salary: int, Address :c$);
data2=load’join2.txt’ using pig storage(‘’)as(id: int, Address :c$);
Join Data=join Data1 by id Data2 by id;
Store Join Data into ’Join- Output’;
>Pig – x local Join script .pig

cd join- Output

cat part-r-00000

Aggregated Functions:-

Capture 15 #> vi Agg.txt

666/Ravi varma/23

grunt A= Load ‘Agg. true’ using pig storage(‘1’) as(id: int, Name: char array age: Int);

grunt> B =group A by age;

grunt> C =for each B generate group, COUNT (A-id);

grunt> DUMP B; Dump C;

Output: –    23,2



MAX:   Grunt> C =for each B generate group, MAX (A-id);


                                                 Gives the MAX id of the group with Age.

SUM:- Grunt> C =for each B generate group, SUM (A-id);

Output :-    (23, 1443)




grunt> A =Load ’Token.txt’ as(record: char array);
grunt> B = for each A generate TOKENIEE(record);
grunt> Dump B;


Capture 15 Removes duplicate tuples in a relation.

Capture 15 DISTINCT does not preserve the original order of the contents. To eliminate duplicates, pig must first sort the dcd

Capture 15 You cannot use DISTINCT on a subset of fields.

Ex:-grunt>A =load ‘data’ Ad(az:int,az:int,a3:int);

Output:  (8,3,4)(4,3,3)(1,2,3)

grunt>X =DISTINCT A;

Grunt>DUMP X;

O/P B:





Capture 15 Select tuples from a relation based on same condition.

Ex:- grunt>A=Load ’date’ as(a1:int, a2:int, a3:int);

Grunt>DUMP A;







grunt>X=FILTER A By .f3==3;

grunt>DUMP X;            Screenshot_1807

                               Third field equals 3




grunt>X=FILTER A By (f3==8)OR(NOT(f2+f3>f1));

grunt>DUMP X;


Capture 15 Groups the data in one or more relations,

Note:-The group and COGROUP Operators are identical.

Both operators work with one or more relations

Capture 15 GROUP is used in statements involving one relations.

Capture 15 CO GROUP is used in statements involving two or more relations

Capture 15 You can CO GROUP upto but not more than 127 relations at a time.


grunt>A=load ‘student’ As(name char array, age: int, gpa: float);
A:{name: char array, age: int, gpa: float}
Grunt>DUMP A;
(John, 18, 4.OF)
(Mary, 19, 3.8F)
(Bill, 20, 3.9F)
(Joe, 18, 3.8F)

Capture 15 Now, suppose we group relation A on field “age” for form relation B

grunt>B=GROUP A By age;

O/P:  etc—

B Group:int A:bag({name: char array, age: int, gpa: float})




grunt>DUMP B;



(19,{many,19,3. 8F)})

(20,{(Bill, 20,3. 9F)})


Capture 15 To use the Hadoop practitioner, add PARTITION BY clause to the appropriate operator:


Grunt>A=LOAD ‘input. data’;
 Org. apache. pig. text. utils.
Simple custom practitioner PARALLEL2;

Code for simple custom practitioner:-

Public class simple custom practitioner extends practitioner<pig null able writeable writable >
Public int get partition pig nullable writable key, writable value, int num partitions)
(key. get value As pig type()instance of integer)
Int ref=(((integer)key. get value As Pig type()). Int value()
%num partitions);
Return set;
Return(key. hash code())% num partitions;


Capture 15 Limit the number of output tuples.

Capture 15 If the specified number of output tuples is equal to or exceeds the number of tuples in the relation, all tuples in the relation are returned.

Note:-The LIMIT Operator allows pig to avoid processing all tuples in a relation.

 Capture 15 In most cases,  a query that uses LIMIT will run more efficiently than an identical query that does no use Limit and it is always a good idea to use limit if you can.

GRUNT>B=group A all;
GRUNT>C=for each B generate COUNT(A)as sum;
GRUNT>D=order A by $o;
GRUNT>E=limit D, C.sum/100;

Capture 15 Suppose, we have relation A

GRUNT>A= load ’data’ As(a1:int,a2:int,a3:int);
Grunt>DUMP A








Capture 15 In this Ex, o/p is limited to 3 tuples, and note that there is no guarantee which 3 tuples will be o/p.

Grunt>X=LIMIT A 3;







0 Responses on What is Pig Latin Hadoop ?"

Leave a Message

Your email address will not be published. Required fields are marked *

Copy Rights Reserved © Mindmajix.com All rights reserved. Disclaimer.