Apache Pig Interview Questions

  • (5.0)
  •   |   737 Ratings

Apache Pig Interview Questions

                                                                                                                    Last Updated: 06.02.2018

If you're looking for Apache Pig Interview Questions for Experienced or Freshers, you are at right place. There are lot of opportunities from many reputed companies in the world. According to research Apache Pig has a market share of about 0.7%. So, You still have opportunity to move ahead in your career in Apache Pig Engineering. Mindmajix offers Advanced Apache Pig Interview Questions 2018 that helps you in cracking your interview & acquire dream career as Apache Pig Engineer.

Enthusiastic about exploring the skill set of Hadoop? Then, have a look at the Hadoop Training together additional knowledge. 

1. What do we understand by PIG?

Pig, it is an Apache open-source project, which operates on Hadoop, providing the engine for the parallel data flow. It contains the language referred as pig Latin, which expresses the data flow. It consists of various operations like sort, joins, filter, etc. & is capable of scripting UDF (User Define Functions) for reading, writing, & processing. Pig uses Map Reduce & HDFS  for storing & the entire task for processing.

2. What is the difference in Pig and SQL?

* Pig Latin shifts from SQL in a declarative style of encoding whereas Hive's query language is similar to SQL.
* Pig is above Hadoop and runs on principle, which can sit on top of Dryad too.
* Hive & Pig, both their commands collect to MapReduce jobs.

3. Explain the requirement of MapReduce while we program in Apache Pig.

The programs of Apache Pig are written in a language referred as Pig Latin, which is analogous to SQL language. To carry out the query, we require an engine for execution. Pig engine alters all the queries to MapReduce tasks. Thus MapReduce operates as the primary execution engine needed to execute the programs.

4. Explain BloomMapFile.

BloomMapFile is categorized as the class, which broadens MapFile class, and generally used for HBase table arrangement to speed up the relationship test for keys, which uses the filters of dynamic bloom.

5. What is a bag in Pig?

A compilation of tuples is known as the bag, in Apache Pig.

6. Why do we need the for each operation in Pig scripts?

The operation FOREACH in Apache Pig is required to apply to each component in data bag, for which the respective action can be performed to create data items.

7. Explain the different data types in Pig.

Following are the three complex data types that is supported by Apache Pig:

* Map, which is the key, value store, connected mutually using #.
* Tuples, similar to the row in the table, where a comma separates various items. Tuples may possess multiple attributes.
* Bags are a collection of tuples, in a unsynchronized manner, which allows many duplicate tuples.

8. What is the function of Flatten in Pig?

Many times there are data in one of the tuple or bag which on removal, lead to next level of nesting for that data. In those cases, Flatten, a modifier, embedded in Pig is used. Flatten uninstalls bags & tuples and replaces all the areas in tuple, whereas the un-nesting bags are more complex of its need in creating a new tuple.

9. What are describe & explain in Apache Pig scripts?

Explain & Describe are important utilities for debugging in Apache Pig.

Describe is helpful to all developers when scripting Pig because it displays the schema of the relation in a script. For developers, who are freshers & are learning Apache Pig use this utility to recognize the process of these operator making the modification to this data. Pig script has many describe.
Explain utility is extremely helpful to developers of Hadoop, when they are trying to optimize Pig Latin scripts or debug error. Explain is applied on a specific alias in scripts or is applied on the entire script in the interactive shell of grunt. Explain utility creates many text graphs, which are printed to files.

10. How does the user communicate with shell in Apache Pig?

Users interact with HDFS or any local file system through Grunt, which is the Apache Pig’s communicative shell. To initiate Grunt, users need to invoke the Apache Pig with a no command as follows:

* Executing command “pig –x local” will prompt - grunt >
* Pig Latin scripts can run either in local mode or the cluster mode by setting up the configuration in PIG_CLASSPATH.
* For exiting from grunt shell, users need to press CTRL+D or just key in the exit.

11. What is a function of illustrate in Apache Pig?

Illustrate is used for implementing the scripts of Pig on vast sets of data, which generally is time-consuming. That’s why developers execute the scripts of pig on a sample data where it’s possible that the selected sample data, may not execute the script correctly. E.g., if the script consists of join operator then there must be few records in sample data which has the same key, or else join operation may not return the results. For managing these issues, developers use the function, illustrate, which takes a data from the sample and whenever it faces operators like the filter or join, which removes the data, it makes sure that some records go through whereas some are restricted, by modifying records in such so that they follow the condition set. Illustrate shows output of every step but does not execute MapReduce jobs.

12. What do we know about case sensitivity of Pig?

Firstly, it is hard to find whether Pig is case sensitive or insensitive. E.g., in user-specific functions, field names, and relations in pig those are case sensitive. The function COUNT is not similar to the functions of count or X=load ‘foo’ is not similar to  x=load ‘foo.' Additionally, keywords in Pig are obviously case insensitive. E.g.  LOAD is similar to load.

13. Distinguish between physical & logical plans in an Apache Pig script.

Physical & logical plans are generated while executing a pig script. Pig is based on the function of interpreter checking. The Logical plan is generated after the semantic verification & parsing while the processing of no data takes place in the generation of any logical plan. A consistent plan consists of a compilation of operators but does not consist of edges involving the operators. After generation of the logical plan, the execution of the script goes to physical plan. Physical plan is the explanation of physical operators, which Pig will use, for the execution of the script. It is more or less similar to a sequence of MapReduce works, but the plans don’t have any such reference of its execution in MapReduce. While the generation of any physical plan, the logical operator cogroup is transformed into physical operators, which are – Global Rearrange, Local Rearrange, and Package.

14. Is Co-group is a group of more than 1 data set?

A group of data sets is referred to as Co-group. In any case, of more than one data set, co-group, groups all the data sets and then joins them based on a common field. That is why; we can say that co-group is obviously a group of more than one data set.

15. Differentiate between HiveQL & PigLatin.

* PigLatin is procedural language, whereas HiveQL is declarative.
* In HiveQL it is necessary to specify the schema, whereas in PigLatin it is optional.
* PigLatin has a nested relative data model, whereas HiveQL has a flat data model.

Check Out Hadoop Tutorials

16. What are the uses of Apache Pig?

Pig big data tools, is specifically used for processing iteratively, for traditional ETL data pipelines & research on raw data. Pig operates in situations where the schema is unknown, incomplete, or inconsistent; it is used by all developers who want to use the data before being loaded into the data warehouse. For building prediction models for behavior, it is used by the website to detect the reply of visitors to a variety of images, ads, articles, etc.

17. Is PigLatin strongly typed language?

Strongly typed language, is characterized where the user should state all the type of variables openly, whereas in Pig, the description of the data, it anticipates the data to approach in the mentioned format. If the schema is unknown, the script adapts to the actual data types at the runtime. That’s why it is stated that PigLatin might be strongly typed in many scenarios, but in some situations, it is otherwise gently typed. It keeps on working with the data, which may not be up to the expectations.

18. Distinguish between COGROUP & GROUP operators.

A GROUP & COGROUP operator is same & work within one or many relations. Operator GROUP is usually used for grouping the data in any one single relation, for enhanced readability, while COGROUP is for gathering the data for 2 or higher relations. COGROUP is a mixture of JOIN & GROUP, i.e., it can group the tables, which are based on columns and joins them on grouped pieces. At any given time, cogroup can feature up to 127 relations.

19. What do we understand by the outer bag and inner bag in Pig?

The outer bag is just any relation in Pig whereas sny relation within a bag is known as the inner bag.

20. Differentiate between COUNT and COUNT_STAR functions in Pig.

The Function COUNT_STAR (0) comprises of NULL values as it counts, whereas COUNT function doesn’t include the NULL value when counting the number of elements in a bag.

21. Do Pig support multi-line commands?

Pig supports single & multi-line commands, both. In the single line command, it carries out the data but doesn’t store the file in the system, but in multiple lines commands it stores the data in HDFS.

22. If I have a relation R then how can I get top 10 tuples from the relation R?

Function TOP () returns the top (N) tuples from a relation or a bag of tuples. (N) Is passed as a constraint to function top () with the column, where the values are supposed to be evaluated in comparison to the relation R.

23. How can we combine the contents of two or more relations & then divide them into a single relation into two or more relations?

The operation can be easily done by using the SPLIT and UNION operators.

24. What are the various types of UDF’s in Java supported by Apache Pig?

Types of User Defined Functions supported in Pig are, Eval Algebraic and Filter functions are.

25. What are the standard functionalities between Pig and Hive?

PigLatin and HiveQL both alter the commands to MapReduce work & cannot be used for transactions in OLAP as it is extremely difficult in executing queries of low latency.

26. If we have a file employee.txt in the Hadoop Data File System directory with minimum 100 records, & want to see the first 25 records only from the employee.txt file. How can we do this?

Firstly we need to load the file employee.txt with the relation name as Employee. Then we can pull the first ten records of the data from the employee file by using the limit operator – Result = limit employee 25.

27. What are the limitations of Pig Script?

Following are some of the Limitations of the Apache Pig a:

*  Apache Pig isn’t preferable for analytics of a single record in huge data sets.
* Pig platform is specifically designed for ETL-type use cases, it’s not a good choice for synchronized or real time scenarios.
* Apache Pig is built on top of MapReduce, which is itself batch processing oriented.

28. Can we join multiple fields in Apache Pig Scripts?

We can join multiple fields in PIG by the join operator, which extracts the records from any one input & joins them with the other specified input. This is done by specifying the keys for each input & both the rows will join as soon as the keys are equal.

29. Why do we use Filters in Apache Pig ?

As the clause in SQL, Apache Pig has to filter for extraction of the records, which are based on predicate or specified condition. The records are then passed through the pipeline if the condition turns to true. Predicate surrounds a variety of operators like ==, <=,!=, >=. For instance - Y = filter X by symbol matches ‘Mr.*’; X= load ‘inputs’ as(name,address)

30. What is UDF in Pig?

If the Built in operators does not provide some of the basic functions, then developers can apply those functions by writing the user defined functions by using programming languages like Python, Java, Ruby, etc. (UDF’s) better known as User Defined Functions are then rooted into the Pig Latin Script.

Explore Hadoop Sample Resumes! Download & Edit, Get Noticed by Top Employers!Download Now!

Q. How to write Java UDF?
UDFs can be developed by extending EvalFunc class and overriding execution method.
Example: This UDF replaces a given string with another string
Import org.apache.hadoop.conf.configuration;
Import org.apache.pig.EvalFunc;
Import Tuple;
Import org.apache.pig.impl.util.UDFContext;
Public classTransform extends EvalFunc{
      Public string exec(Tuple input) throws IOException {
           if(input == null || input.size[] == 0) {
                    Return null;
Configuration conf=UDFContext.getUDFContext().getJobConf();
String from = conf.get(“replace.string”);
if(from == null){
Throw new IOException (“replace.string should not be null”);
String to = conf.get(“”);
Throw new IOException (“ should not be null”);
String str = (string) input.get(0);
Return str.replace(from, to);
} catch (exception e){
Throw new IOException(“caught exception processing input row”,e);

Q. What is Grunt Shell?
Grunt Shell is an interactive based shell. Which means where exactly we will get the output than and their itself. Whether it is success (or) fail.

Q. What is Pigstorage?
Loads or stores relations using field delimited text format.
Each line is broken into fields using a configurable field delimiter (defaults to a tab character) to be stored in the tuples fields. It is the default storage when none is specified.

Q. Where Does Pig Live?
1. Pig is installed on user machine.
2. No need to install anything on the hadoop cluster
3. Pig and Hadoop versions must be compatible.
4. Pig submits and executes jobs to the hadoop cluster

Q. Hive used for types of applications?
1. Summarization
Ex:- Daily/Weekly aggregations of impression/click counts
2. Complex measure of user engagement
3. Ad Hoc Analysis
Ex:- How many group admins broken down by state/country
4. Data Mining (Assembling Training Data)
Ex:- User engagement as a function of user attributes.
5. Spam Detection
6. Anomalous patterns for site integrity. 
7. Application API usage patterns
8. Ad Optimizations
9. Document indexing 
10. Customer facing business intelligence (Ex: Google analytics) Predictive modeling, hypothesis testing 

Q. What is Hive QL?
1. Support SQL like query Language called HiveQL for select, join, aggregate, union all and subquery in the from clause. 
2. Support DDL statement such as CREATE table with serialization format, partitioning and bucketing columns.
3. Command to load data from external sources and INSERT into HIVE tables.
4. Do not support UPDATE and DELETE.
5. Support multi table INSERT
6. Support user defined column transformation (UDF) and aggregation (UDAF) function written in Java.

Q. What is the Difference Between Pig & SQL?

Pig is procedural SQL is declarative
Nested relational data model Flat relational data model
Schema is optional Schema is required
Scan Centric analytytic workloads OLTP + OLAA workloads
Limited query optimization Significant opportunity for query optimization

Q. What is the Difference Between Hive & Pig?

Hive Pig
Language is SQL Language is Pig Latin
Schema: Table definitions that are stored in a metastore A schema is optionally defined at runtime
Hive programmatically access is JDBC, ODBC

Pig access is pigserver

The hive have partitions There is no partitions
Server is optional No server
Custom Serializer/ Deserializer Custom Serializer/ Deserializer
DFS direct access at run time DFS direct access at default
Join/order/set is possible

Join/order/set is possible

Shell command interface is possible Shell command interface have
Streaming is supported Streaming is supported
Web interface is possible There is no web interface

Q. What is the Difference Between Mapreduce & Pig?

Mapreduce Pig
Mapreduce expects the programming language skills for writing the business logic Pig there is no much of programming skills. As we are writing whole logic will making use of pig transformation (or) operations.
If we can do any change in the Mapreduce reduce program, we need to certain problems we can change the process entire.
Compiling the program
Executing the program
Packing up the program
Deploying the same cluster environment
In the pig, we can completes dealing with simple scripting we can avoid other transaction process.
5 % of the Mapreduce code
5% of the Mapreduce development time
Increases programmer productivity
25% of the Mapreduce execution time
As a general saying of Hadoop Mapreduce program write 200 lines of mapreduce code. In pig we can that type of Mapreduce program, we can write 10 lines of code.
Mapreduce requires multiple stages, Leading to long development life cycles Rapid prototyping increase productivity. Pig provides the log analysis
Ad Hoc queries across various large data sets.

List of Other Big Data Courses:

 Hadoop Adminstartion  MapReduce
 Big Data On AWS  Informatica Big Data Integration
 Bigdata Greenplum DBA  Informatica Big Data Edition
 Hadoop Hive  Impala
 Hadoop Testing  Apache Mahout



Popular Courses in 2018

Get Updates on Tech posts, Interview & Certification questions and training schedules