Updated on 2022-09-14 GMT+08:00

User-defined Functions

When built-in functions of Hive cannot meet requirements, you can compile user-defined functions (UDFs) and use them for query.

According to implementation methods, UDFs are classified as follows:

  • Common UDFs: used to perform operations on a single data row and export a single data row.
  • User-defined aggregating functions (UDAFs): used to input multiple data rows and export a single data row.
  • User-defined table-generating functions (UDTFs): used to perform operations on a single data row and export multiple data rows.

According to use methods, UDFs are classified as follows:

  • Temporary functions: used only in the current session and must be recreated after a session restarts.
  • Permanent functions: used in multiple sessions. You do not need to create them every time a session restarts.

The following uses AddDoublesUDF as an example to describe how to compile and use UDFs.

Function Description

The AddDoublesUDF is used to add two or more floating point numbers. The following example describes how to compile and use UDFs.

  • A common UDF must be inherited from org.apache.hadoop.hive.ql.exec.UDF.
  • A common UDF must implement at least one evaluate(). The evaluate function supports overloading.

Sample Code

The following is a UDF code example.

package com.huawei.bigdata.hive.example.udf;
import org.apache.hadoop.hive.ql.exec.UDF;

public class AddDoublesUDF extends UDF { 
 public Double evaluate(Double... a) { 
    Double total = 0.0; 
    // Processing logic
    for (int i = 0; i < a.length; i++) 
      if (a[i] != null) 
        total += a[i]; 
    return total; 
  } 
} 

How to Use

  1. Log in to MRS Manager and configure the Hive administrator permission for the Hive service user who uses UDFs.

    1. Log in to MRS Manager, choose System > Manage Role > Create Role, and create a role with the Hive Admin Privilege permission.
    2. On MRS Manager, choose System > Manage User.
    3. In the Operation column of the user, click Modify.
    4. Bind a role with the Hive Admin Privilege permission to the user and click OK.

  2. Create a UDF package in the example directory of the project, compile the AddDoublesUDF class, package the project (for example, AddDoublesUDF.jar), and upload the package to a specified HDFS directory (for example, /user/hive_examples_jars/). Grant the read permission on the file to the user who creates the function and who uses the function. The following are example statements.

    hdfs dfs -put AddDoublesUDF.jar /user/hive_examples_jars

    hdfs dfs -chmod 777 /user/hive_examples_jars

  3. If the Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. If the Kerberos authentication is disabled for the current cluster, skip this step. The current user is the development user added in Preparing a Development User.

    kinit Hive service user

    For example, kinit -kt '/opt/conf/user.keytab' hiveuser (set the user.keytab path based on the site requirements).

  4. Run the set role admin; command to grant the administrator permission to the user.
  5. Run the following command:

    beeline -n Hive service user

  6. Define the function in HiveServer. Run the following SQL statement to create a permanent function:

    CREATE FUNCTION addDoubles AS 'com.huawei.bigdata.hive.example.udf.AddDoublesUDF' using jar 'hdfs://hacluster/user/hive_examples_jars/AddDoublesUDF.jar';

    addDoubles indicates the function alias that is used for SELECT query.

    Run the following statement to create a temporary function:

    CREATE TEMPORARY FUNCTION addDoubles AS 'com.huawei.bigdata.hive.example.udf.AddDoublesUDF' using jar 'hdfs://hacluster/user/hive_examples_jars/AddDoublesUDF.jar';

    • addDoubles indicates the function alias that is used for SELECT query.
    • TEMPORARY indicates that the function is used only in the current session with the HiveServer.

  7. Run the following SQL statement to use the function on the HiveServer:

    SELECT addDoubles(1,2,3);

    If an [Error 10011] error is displayed when you log in to the client again, run the reload function; command and then use this function.

  8. Run the following SQL statement to delete the function from the HiveServer:

    DROP FUNCTION addDoubles;