Using UDFs in SQL Queries

Scenario

SQL on Elasticsearch is enhanced based on the SQL plug-in of the open source Open Distro. This section describes how to implement the UDF feature in SQL based on the SQL plug-in of Open Distro and Painless of Elasticsearch.

The UDF feature is used to calculate and convert the value of a field in an index to obtain a custom field, for example:

Combines the firstname and lastname fields and returns a custom field name.
Multiplies the value of a numeric field by 2 and returns a new field.
Converts the value of a string field to uppercase letters.

Prerequisites

In security mode, you have configured the following permissions based on the current authentication mode:

In user- and role-based authentication mode, see Authentication Based on Users and Roles for configuration.
- To create a UDF using a script, you must belong to the elasticsearch user group.
- To use UDFs in SQL statements, you must belong to the supergroup user group.
In Ranger authentication mode, configure the permissions by referring to Adding a Ranger Access Permission Policy for Elasticsearch and additionally select the script and read permissions in Permissions.

UDF Compilation and Verification

In Elasticsearch, UDF compilation is to create a stored script.

Example 1:

# Create script calculate_multiplier.
POST _scripts/calculate_multiplier
{
  "script": {
    "lang": "painless",
    "source": "doc[params['field']].value * params['multiplier'] "
  }
}

# Delete script calculate_multiplier.
DELETE _scripts/calculate_multiplier

Example 2:

# Create script to_upper.
POST _scripts/to_upper
{
  "script": {
    "lang": "painless",
    "source": "params['_source'][params['field']].toUpperCase()"
  }
}
# Delete script to_upper.
DELETE _scripts/to_upper

lang: language supported by the script. When the script is used as a UDF, set lang to painless.
For details about the Painless syntax, visit the following website:

https://www.elastic.co/guide/en/elasticsearch/painless/7.10/painless-lang-spec.html
source: operations in a UDF
In example 1, source indicates that the content of the field parameter is obtained and the content is multiplied by the value of the multiplier parameter.

In example 2, source indicates that the content of the field parameter is obtained and then it is converted to uppercase letters.

For details about the operations that can be used in source, visit the following website:

https://www.elastic.co/guide/en/elasticsearch/painless/7.10/painless-api-reference-field-java-lang.html#painless-api-reference-field-String

https://www.elastic.co/guide/en/elasticsearch/painless/7.10/painless-api-reference-shared-java-lang.html#painless-api-reference-shared-Number
The parameter type can be int, long, string, or double.
script id: In the preceding examples, calculate_multiplier and to_upper are unique identifiers of the script, that is, script ID.
If the script is used as a UDF in SQL queries, the value of script id can contain only letters, numbers, hyphens (-), and underscores (_).
doc[]: Obtains the value of a specified field from doc_values. By default, the numeric type and keyword type support doc_values, but the text type does not support doc_values.
To obtain the content of a field (for example, of the text type) that does not support docvalue in a script, use_ params['_source'], that is, obtain the value of the field from _source. You should avoid using this method as it is inefficient to obtain values from _source.
UDFs need to be used in specific indexes. If a specific field like doc['my_field'] is used, ensure that the index that uses a UDF contains the my_field field. Otherwise, an exception occurs.

You can run commands similar to the following to check whether the preceding script can be executed as expected.

You are advised to verify the validity of the stored script after your compilation.

PUT accounts/_bulk?refresh
{"index":{"_id":"1"}}
{"account_number":1,"balance":39225,"firstname":"Amber","lastname":"Duke","age":32,"gender":"M","address":"880 Holmes Lane","employer":"Pyrami","email":"amberduke@pyrami.com","city":"Brogan","state":"IL"}
{"index":{"_id":"2"}}
{"account_number":6,"balance":5686,"firstname":"Hattie","lastname":"Bond","age":36,"gender":"M","address":"671 Bristol Street","employer":"Netagy","email":"hattiebond@netagy.com","city":"Dante","state":"TN"}
{"index":{"_id":"3"}}
{"account_number":13,"balance":32838,"firstname":"Nanette","lastname":"Bates","age":28,"gender":"F","address":"789 Madison Street","employer":"Quility","email":"nanettebates@quility.com","city":"Nogal","state":"VA"}

GET accounts/_search
{
  "script_fields": {
    " script filled": {
      "script": { 
        "id": "to_upper", 
        "params": {
          "field": "firstname"
        }
      }
    }
  }
}

accounts: index name
to_upper: name of a stored script
The content in params is a parameter list.

Using UDFs in SQL Queries

Use a script as a UDF in SQL queries.

When a UDF is used in SQL queries, the format is as follows:

UDF_calculate_multiplier(field=age,multiplier=2)

UDF_: Fixed start of a UDF used in SQL queries
calculate_multiplier: name of the script created in Elasticsearch. The value can contain only letters, numbers, hyphens (-), and underscores (_).
The content in the brackets indicates the parameter list. Keys and values are separated by equal signs (=). Multiple parameters are separated by commas (,). Brackets must be added regardless of whether parameters are contained.
When UDFs are used in SQL queries, UDF_ is used as the keyword for identifying UDFs. Field names starting with UDF_ cannot be queried because ambiguity may occur, resulting in lexical and syntax analysis exceptions in SQL queries.

Supported scenarios

The UDF calculation result in a SELECT statement is used as a custom field.

POST _opendistro/_sql/
{
  "query":"SELECT UDF_calculate_multiplier(field=age,multiplier=2) FROM accounts"
}
 
POST _opendistro/_sql/
{
  "query":"SELECT UDF_to_upper(field=firstname) FROM accounts"
}

SELECT UDF is used together with other fields for query.

POST _opendistro/_sql/
{
  "query":"SELECT UDF_to_upper(field=firstname), firstname FROM accounts"
}

UDFs support aliases.

POST _opendistro/_sql/
{
  "query":"SELECT UDF_calculate_multiplier(field=age,multiplier=2) as doubleAge FROM accounts"
}

UDFs are used together with the WHERE clause.

POST _opendistro/_sql/
{
"query":"SELECT firstname, UDF_calculate_multiplier(field=age,multiplier=2)  as udfalias FROM accounts WHERE age > 10"
}

UDFs are used together with the LIMIT clause.

POST _opendistro/_sql/
{
  "query":"SELECT age, UDF_calculate_multiplier(field=age,multiplier=2)  as pv FROM accounts  limit 1"
}

Multiple UDFs are used together.

POST _opendistro/_sql/
{
  "query":" SELECT UDF_calculate_multiplier(field=age,multiplier=2),UDF_to_upper(field=firstname) FROM accounts"
}

UDFs are used together with the GROUP BY clause.

POST _opendistro/_sql/
{
  "query":" SELECT UDF_calculate_multiplier(field=age,multiplier=2) as pv FROM accounts WHERE age > 0 GROUP BY pv"
}

UDFs are used as a parameter for sum(), avg(), count(), min(), and max() on the condition that the return values of the UDFs are of the numeric type.
```
POST _opendistro/_sql/
{
  "query":" SELECT sum(UDF_calculate_multiplier(field=age,multiplier=2)) FROM accounts"
}
```

Using UDFs in SQL Subqueries

UDFs used in SQL queries support only one-layer subquery instead of multi-layer nesting. Query results of inner layers are used as UDF parameters in the outer layer.

If a UDF still needs to be used in indexes, add the index=Index name parameter to the outer query to specify the index on which the UDF is calculated. This parameter takes effect only in the UDF of the outer SQL query, if this parameter is configured in the inner UDF, it will be ignored.

index=Index name indicates all indexes by default. If nested SQL queries are used, this parameter must be configured. Otherwise, an exception occurs because an index may not contain fields defined in the UDF.

Example of using nested UDFs:

POST _opendistro/_sql/
{
  "query" : "SELECT UDF_calculate_multiplier(field=age,multiplier='num',index=accounts) FROM (SELECT account_number as num from accounts limit 3) as a"
}

Outer query supports only SELECT and FROM and cannot be used together with other keywords.
The inner query results and the inner query results used in outer UDFs must have aliases. Otherwise, the syntax will be abnormal.
The alias of the inner query results used in outer UDFs must be unique to avoid ambiguity.
When the inner query results are used as the parameter of the outer query, the alias of the subquery field can be directly used in the outer UDF. That is, num in the preceding example instead of a.num.

When UDFs are used in a SQL query that contains subqueries, each result of the subqueries needs to be interacted with Elasticsearch again because the subquery results are used as a parameter in the UDFs and cannot be pushed down to Elasticsearch. That is, queries are amplified in subqueries, and an SQL query is converted into several (value of size) inner queries. To ensure query performance, reduce the value of size returned by the queries as much as possible. This value is adjustable by configuring opendistro.query.size_limit. The default value is 200.

Parent topic: Using Elasticsearch

Previous topic: Index Lifecycle Management

Next topic: Connecting Elasticsearch to Other Components