Blog

How to Write Hive UDF (User-Defined Functions) - Hadoop

Hive UDF

Sometimes the query you want to write can’t be expressed easily using the built–in functions that HIVE provides.

 By writing UDF (User Defined function) hive makes it easy to plug in your own processing code and invoke it from a Hive query.

UDF’s have to be writhen in Java, the Language that Hive itself is written in.

There are three types of  UDF’s in Hive

1. UDF’s (regular)
2. UDF’s (user defined Aggregate Functions)
3. UDF’s (user defined table – generating Functions)
They differ in the number of rows in which they accept input and produces output.

1) UDF Operates on a single row and produces a single row as its output has most of the functions, such as mathematical functions.

2) UDAF’S:-

UDAF works on multiple input rows and creates a single output row and aggregate functions which include functions such as count and MAX.

  • A UDTF:-Operates on a single row and produces multiple rows- a table- as output.

Table–generating function are less well known than the other two types.

Ex:- Consider a table with a single column x which contains arrays of strings.

hive>CREATE TABLE arrays(*ARRAY DELIMITED FIELDS TERMANATED By’?01’Collection
ITEMS By’?02’;

After running a LOAD DATA Command, the following query confirms that the data was loaded correctly:

hive>SELECT * FROM arrays;

[“a”, ”b”]

[“c”, ”d” ,“e”]

Next, we can use the explode UDTF to transform this table

This function emits a row for each entry in the array.

So, in this case the type of the output column y is STRING.

The result is that the table is flattened into five rows:

Hive>SELECT explode(x)As y from arrays;

SELECT Statements using UDTFs have some restrictions such as not being able to retrieve additional column expressions.

Writing a UDF:-

 We can write a simple UDF by using characters from the ends of strings.

Hive already has a built- in function called, so we can call the strip

The code for the strip Java class is shown as below for stripping characters from the ends of strings

Package com. hadoop book .hive;
Import . org . apache. Common. Long. String URLS;
Import . org . apache. hadoop. Hive. ql. exec UDF;
Import . org . apache. hadoop. Io .text;
Public class strip extends UDF
{
Private Text result = new text();
Public. Text. evaluate(Text str)
{
If(str==null)
{
Return null;
}
Result. set(string utils. Strip(str. To string()));
Return result:
}
Public. Text. evaluate(Text str, string strip chers)
{
If(str==null)
{
Return null;
}
result. set(string utils. Strip(str. To string(),strip chars));
Return result;
}
}

A UDF must satisfy the following two properties:

1. A UDF must be a sub class of org. apache. Hadoop. Hive ql. exec. UDF
2. A UDF must implement at least one evaluate() method.

 The strip class has two evaluate() methods. Which are not defined by an interface

The first strips leading and trailing white space from the input while the second strip has set of supplied characters from the ends of the string.

To use MB UDF in Hive, Run as JAVA Application and register the file with Hive:

hive>ADD JAR/path/to/Hive-examples.jar;

We also need to create an alias for the java class name:

Hive)CREATE TEMPORARY FUNCTION strip As ‘com-hadoop book. Hive. strip.;

To call ADD JAR, you can specify at launch time a path where Hive looks for auxiliary JAR files to put on its class path.

This technique is used for automatically adding your own library of UDFs for every time you hive.

There are two ways of specifying the path either by passing the –aux path option to the hive command as below:

%hive—aux path/path/to/Hive-examples jar

or by setting the HIVE-AUX-JARS-PATH environment variable before involving Hive.

The UDF is now ready to be used, just like a built-in function:

hive>SELECT EMPID, Strip(EMPNAME),ESAL FROM Employee;

(Or)

hive>SELECT strip(‘banana’, ’ab’)FROM dummy;

Output is: non


RELATED COURSES

Get Updates on Tech posts, Interview & Certification questions and training schedules