Hadoop’s Pig Data Types and Syntax

Recommended by 0 users

Pig Data Types

Capture 15 Every piece of data in pig has one of these four types:

Data Atom: is a simple atomic data value and it is stored as string but can be used either a string or a number.

Examples:‘apache.org’ and ‘1-0’

Tuple: is a data record consisting of a sequence of “fields” and each field is a piece of data of any type (data atom, tuple or data bag)

Capture 15 We denote tuples with <> bracketing

Capture 15 Example of a hepde is<apache.org, 1.0>

Data Bag: Is a set of tuples (duplicate tuples are allowed)

Capture 15 Think of it as a “table”, except that pig does not require that the tuple field types match, or even that the tuples has the same no. of fields. Bag could be {<apache.org,1.0><flickr.com,0.8>}

Data Map: is a map from keys that are string literals to values that can be of any data type.

Capture 15 Think of it as a Hash map<string, X> where X can be any of the 4 pig data types.

Capture 15 A data map supports the expected get and put interface.

Data Types in Pig:

Other Language Pig
int int
string Char array
float float
long long
double double
boolean boolean

Different Transformations in Pig:

REGISTER- Register jar file with the pig runtime

DEFINE- Create an alias for a macro, UDF, Streaming script (or) command specification.

IMPORT- Import macros defined in separate file into a script.

Typical Transformations:

 Load: load data from the file system.

FILETER: Remove unwanted rows from a location

FOREACH: Particular column is displayed

GENERATE: Add or Remove fields from a Relation

GROUP: To group data in a single relation.

COGROUP: To group or join data in two or more relation

UNION: To merge the contents of two or more relations

SPLIT: To partition the contents of a relation into multiple relations

JOIN (Inner or Outer): To join the data in two or more relations

ORDER: Sort the relations by one or more fields

LIMIT: Limits the size of a relation to a maximum no. of tuples

Debugging Pig Latin:  

Capture 15 Pig Latin provides operators that help you debug the pig latin statements.

DUMP: To display the results to your terminal screen

DESCRISE: To review the schema of a relation.

EXPLAIN: To view the logical, physical or map reduce execution plans to compute a relation.

ILLUSTRATE: To view the step-by-step execution of a series of statements.


0 Responses on Hadoop’s Pig Data Types and Syntax"

Leave a Message

Your email address will not be published. Required fields are marked *

Copy Rights Reserved © Mindmajix.com All rights reserved. Disclaimer.
Course Adviser

Fill your details, course adviser will reach you.