Blog

Hadoop’s Pig Data Types and Syntax

Pig Data Types

 Every piece of data in PIG has one of these four types:

Data Atom: is a simple atomic DATA VALUE and it is stored as string but can be used either a string or a number.
Examples:‘apache.org’ and ‘1-0’
 
Tuple: is a data record consisting of a sequence of “fields” and each field is a piece of data of any type (data atom, tuple or data bag)
 
  We denote tuples with <> bracketing
   Example of a hepde is
 
Data Bag: Is a set of tuples (duplicate tuples are allowed)
 
  Think of it as a “table”, except that pig does not require that the tuple field types match, or even that the tuples has the same no. of fields. Bag could be {}
 
Data Map: is a map from keys that are string literals to values that can be of any data type.
 
  Think of it as a Hash map where X can be any of the 4 pig data types.
 
  A data map supports the expected get and put interface.
 
Data Types in Pig:
 Other language Ping
Int Int
string char array
float float
long long
double double
boolean boolean

Different Transformations in Pig:

REGISTER- Register jar file with the pig runtime

DEFINE- Create an alias for a macro, UDF, Streaming script (or) command specification.

IMPORT- Import macros defined in separate file into a script.

Typical Transformations:

 Load: load data from the file system.

FILETER: Remove unwanted rows from a location

FOREACH: Particular column is displayed

GENERATE: Add or Remove fields from a Relation

GROUP: To group data in a single relation.

COGROUP: To group or join data in two or more relation

UNION: To merge the contents of two or more relations

SPLIT: To partition the contents of a relation into multiple relations

JOIN (Inner or Outer): To join the data in two or more relations

ORDER: Sort the relations by one or more fields

LIMIT: Limits the size of a relation to a maximum no. of tuples

Debugging Pig Latin:  

  Pig Latin provides operators that help you debug the pig latin statements.
 
DUMP: To display the results to your terminal screen
 
DESCRISE: To review the schema of a relation.
 
EXPLAIN: To view the logical, physical or map reduce execution plans to compute a relation.
 
ILLUSTRATE: To view the step-by-step execution of a series of statements.

RELATED COURSES

Get Updates on Tech posts, Interview & Certification questions and training schedules