User-defined functions (UDF)

๐Ÿ“˜

Currently available in the AWS US environment only

User-defined functions (UDF) are routines that accept parameters, perform actions like complex calculations, and return the result of that action as a value. The return value can either be a single scalar value or a result set.

ByteHouse supports the following types of UDF:

  1. Lambda UDF: User-defined lambda function
  2. Python UDF: Define functions in Python and return scalarย result

1. Lambda UDF

Creates a UDF from a lambda expression. The expression must contain function parameters, constants, operators, or other function calls.

Syntax

CREATE FUNCTION name AS (parameter0, ...) -> expression

For more general syntax instructions, please refer to the Python UDF syntax section

Restrictions

A function can have an arbitrary number of parameters, with a few restrictions:

  1. The name of a function must be unique, different from other user-defined functions and system functions.
  2. Recursive lambda functions are not allowed.
  3. All variables used by a function must be specified in its parameter list.

If any of the above rules are violated, an exception will be raised.

Examples

As a UDF is created in a specific database, you need to create and specify a database to start creating and using UDF.

CREATE DATABASE myudf;
USE myudf;

Example 1

CREATE FUNCTION linear_equation AS (x, k, b) -> k*x + b;
SELECT number, linear_equation(number, 2, 1) FROM numbers(3);
โ”Œโ”€numberโ”€โ”ฌโ”€plus(multiply(2, number), 1)โ”€โ”
โ”‚      0 โ”‚                            1 โ”‚
โ”‚      1 โ”‚                            3 โ”‚
โ”‚      2 โ”‚                            5 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Example 2

Aย condition function is called in a user-defined function in the following query:

CREATE FUNCTION parity_str AS (n) -> if(n % 2, 'odd', 'even');
SELECT number, parity_str(number) FROM numbers(3);
โ”Œโ”€numberโ”€โ”ฌโ”€if(modulo(number, 2), 'odd', 'even')โ”€โ”
โ”‚      0 โ”‚ even                                 โ”‚
โ”‚      1 โ”‚ odd                                  โ”‚
โ”‚      2 โ”‚ even                                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

2. Python UDF

Introduction

A Python UDF is a python script that can be called by ByteHouse. ByteHouse now supports Python written UDF. A scalar Python UDF incorporates a Python program that runs when the function is called and returns a single value.

When a query calls a Python UDF, the following steps occur at runtime:

  1. The function converts the input arguments to Python Numpy data types.
  2. The function passes the converted input arguments then runs the Python program.
  3. The Python code returns a single value. The data type of the return value must correspond to the RETURNS data type specified by the function definition.
  4. The function converts Python's return value to the defined data type, then returns that value to the query.

System Architect

1554

ByteHouse Cloud UDF capabilities ensure isolation while ensuring high performance and security:

  • Each server and worker container will have its own UDF sidecar container
  • High performance: Extended Python to access ClickHouse's raw vectorized memory data format, batch processing and zero network IO cost
  • Security: Minimized external access for UDF container
    - Only shared volume access is supported and network access can be disabled
    - Process isolation inside the UDF container

Data Types

SQL TypePython TypeNotes
UInt8NPY_UINT8
UInt16NPY_UINT16
UInt32NPY_UINT32
UInt64NPY_UIN64
Int8NPY_INT8
Int16NPY_INT16
Int32NPY_INT32
Int64NPY_INT64
Float32NPY_FLOAT32
Float64NPY_FLOAT64
Decimal32Cast to Float
Decimal64Cast to Float
UUIDNPY_STRINGFixed length of 16 bytes
DateNPY_UINT16https://numpy.org/doc/stable/reference/arrays.datetime.html
DateTimeNPY_UINT32
StringNPY_STRING
FixedStringNPY_STRING
Nullablemasked arrayhttps://numpy.org/doc/stable/reference/maskedarray.html
ArrayndarrayOnly support input as Array. https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html
  • Date Type: Stored in two bytes as the number of days since 1970-01-01 (unsigned). Supported range of values: [1970-01-01, 2148-12-31].

    Allows storing values from just after the beginning of the Unix Epoch to the upper threshold defined by a constant at the compilation stage (currently, this is until the year 2149, but the final fully-supported year is 2148).

  • DateTime Type: Stored in four bytes as the number of seconds since 1970-01-01 00:00:00 (unsigned). Supported range of values: [1970-01-01 00:00:00, 2105-12-31 23:59:59].

Runtime Manual

ByteHouse provides standard libraries, and we can also help users to pre-install customized dependencies if needed.

Syntax

1. CREATE Function

CREATE FUNCTION [database_name.]function_name [TYPE udf_type] [RETURNS data_type] [FLAGS flags] LANGUAGE PYTHON AS
$custom_tag$

class function_name():
    def entry(self, output, inputs):
        expression
        
$custom_tag$

# database_name is optional, current database will be used if not set
# udf_type: SCALAR (default), AGGREGATE (Not supported for current release)
# data_type: use the type of the first input column if not set, check 'Data Types' section above for all supported types
  • Flags
    The following flags are stored in BitMap. These flags are defined to optimize queries as well as to define certain functionality. When creating a UDF, it is optional to specify the flags.
UseDefaultImplementationForLowCardinalityColumns, /* enable always. convert all low cardinality columsn to ordinary column */
UseDefaultImplementationForConstants,             /* convert to ordinary column if all arguments are constant */
UseDefaultImplementationForNulls,                 /* peel off nullable and use nested column */
SuitableForShortCircuitArgumentsExecution,        /* evaluate this function lazily in short-circuit function arguments */
SuitableForConstantFolding,                       /* enable early execution if all arguments are constant */
CanBeExecutedOnLowCardinalityDictionary,          /* disable always. not support for now */
CanBeExecutedOnDefaultArguments,                  /* conterexample: modulo(0,0) */
DeterministicInScopeOfQuery,                      /* return same result for same values during single query. example: now() */
Deterministic,                                    /* return same result for same values. counterexample: rand() */
AlwaysPositiveMonotonicity,                       /* Positive monotonicity */
AlwaysNegativeMonotonicity,                       /* Negative monotonicity */
Injective,                                        /* func should return unique result if inputs are different. optimization for group by */
Stateful,                                         /* for pushdown and merge optimization */
  • $custom_tag$
    $custom_tag$ is a dollar-quoted string. In between the $$ symbols, you can use any custom tag or leave it empty, just make sure to close it again with the same tag. (Refer to PostgreSQL 4.1.2.4. Dollar-quoted String Constants).

    For example, here are two different ways to specify the string "Dianne's horse" using dollar quoting:

$$Dianne's horse$$
$SomeTag$Dianne's horse$SomeTag$

-- Everything inside $tag$ is treated as python code. 
-- No python syntax checks will be done during creation of Python UDF.

2. DROP Function

DROP FUNCTION [IF EXISTS] [db_name].function_name;
  • IF EXISTS is optional, it prevents errors if the function doesn't exist. In a normal case, if the function doesn't exist and you try to drop the function, you will face an error.
  • db_name is optional if you use USE db_name to check the database initially.

3. SHOW Function

To see all UDF functions and specific UDF of the currently used DB function list

SHOW functions;  -- view all udf
DESCRIBE FUNCTION udf_name; --- view specific udf

4. SHOW CREATE Function

To see the definition of the specific UDF

SHOW CREATE FUNCTION udf_name;

5. SELECT Function

UDF works like any other function in ByteHouse, but it requires you to specify a database name to access it. For example, to use a UDF which calculates the area of a circle:

drop function if exists test.area_of_circle;

CREATE FUNCTION test.area_of_circle RETURNS Float64 LANGUAGE PYTHON AS

$pikachu$
from iudf import IUDF
from overload import overload
from math import pi

class area_of_circle(IUDF):
    @overload
    def process(radius):
        return radius*radius*pi
$pikachu$;

Examples

Example 1:

This UDF example will add the value of each row in this table and output the result.

  • Create the Database and Table for UDF
    Before creating a Python UDF, you must specify the database name and table.
CREATE DATABASE test;

CREATE TABLE test.test_udf_int
(
    a Int64,
    b Int64,
    c Int64,
    d Int64
)
ENGINE = CnchMergeTree()
PRIMARY KEY a
ORDER BY a;

INSERT INTO test.test_udf_int(a, b, c, d)
VALUES (0, 0, 0, 0 ) (1, 1, 1, 1) (2, 2, 2, 2);
  • Create the UDF
    Next, create the UDF function by using CREATE FUNCTION. This is an example of how to write the entry function in Python and uses [code] as the delimiter.

    As ByteHouse is a column-based database,  but generally you want to do the calculation for each row. In this example, the entry function uses an iterator to loop through each row and sum up the four input values in each row to get the output value.
    
CREATE FUNCTION test.f_plus_int RETURNS Int64 LANGUAGE PYTHON AS
$code$
import numpy as np

class f_plus_int:
    def entry(self, output, inputs):
        nargs = len(inputs)

        it = np.nditer([output] + inputs,
                ['refs_ok', 'buffered'], [['writeonly']] + [['readonly']]*nargs)

        with it:
            for out, *ins in it:
                out[...] = self.process(*[x.item() for x in ins])
                
    def process(self, x1, y1, x2, y2):
       return x1 + y2 + x2 + y2
$code$;

Alternatively, you can use the fully column-based method provided by Python Numpy as shown below. It does the same operation as test.f_plus_int.

Note: The performance of using the column-based method is better than using loops.

CREATE FUNCTION test.f_plus_int RETURNS Int64 LANGUAGE PYTHON AS
$code$
import numpy as np

class f_plus_int:
    def entry(self, output, inputs):
        np.add.reduce(inputs, out=output)
$code$;
  • Call the UDF
    After the Python UDF function is created, it can be called just like a general built-in function.
select test.f_plus_int(a,b,c,d) from test.test_udf_int;
โ”Œโ”€test.f_plus_int(a, b, c, d)โ”€โ”
โ”‚                           0 โ”‚
โ”‚                           4 โ”‚
โ”‚                           8 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
  • Drop the UDF
drop function test.f_plus_int;

Example 2:

  • Create the Database and Table for UDF
CREATE TABLE test.py_fixedstring
(
    a FixedString(2),
    b FixedString(3)
)
ENGINE = CnchMergeTree()
ORDER BY b;

INSERT INTO test.py_fixedstring(a, b)
VALUES ('aa', 'aaa') ('bb', 'bbb') ('cc', 'ccc') ('dd', 'ddd') ('ee', 'eee') ('ff', 'fff') ('gg', 'ggg') ('ab', 'abc') ('cb', 'cba') ('tt', 'zzz');
  • Create the UDF
    For return types like FixedString, you must specify a length for the returned string.
CREATE FUNCTION test.py_fixed_string
RETURNS FixedString(5)
LANGUAGE PYTHON AS
$code$
import numpy as np

class py_fixed_string:
    def entry(self, output, inputs):
        nargs = len(inputs)

        it = np.nditer([output] + inputs,
                ['refs_ok', 'buffered'], [['writeonly']] + [['readonly']]*nargs)

        with it:
            for out, *ins in it:
                out[...] = self.process(*[x.item() for x in ins])

    def process(self, a, b):
        return a + b
$code$;
  • Call the UDF
select test.py_fixed_string(a, b) from test.py_fixedstring;
โ”Œโ”€test.py_fixed_string(a, b)โ”€โ”
โ”‚ aaaaa                      โ”‚
โ”‚ ababc                      โ”‚
โ”‚ bbbbb                      โ”‚
โ”‚ cbcba                      โ”‚
โ”‚ ccccc                      โ”‚
โ”‚ ddddd                      โ”‚
โ”‚ eeeee                      โ”‚
โ”‚ fffff                      โ”‚
โ”‚ ggggg                      โ”‚
โ”‚ ttzzz                      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
  • Drop the UDF
drop function test.py_fixed_string;

Example 3:

  • Create the Database and Table you need for UDF
CREATE TABLE test.test_udf_array
(
    a Array(UInt8)
)
ENGINE = CnchMergeTree()
PRIMARY KEY a
ORDER BY a;

insert into test.test_udf_array values ([1,2,3]), ([6,7,8]);
  • Create the UDF
    In the entry function, this example just uses a simple for loop to do the calculation.
CREATE FUNCTION test.f_test_array RETURNS UInt64 LANGUAGE PYTHON AS
$code$
import numpy as np

class f_test_array:
    def entry(self, output, inputs):
        for i in range(0, len(output)):
            output[i] = np.sum(inputs[0][i])
$code$;
  • Call the UDF
select test.f_test_array(a) from test.test_udf_array;
โ”Œโ”€test.f_test_array(a)โ”€โ”
โ”‚                    6 โ”‚
โ”‚                   21 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
  • Drop the UDF
drop test.test_udf_array

Overload Function

In addition, ByteHouse provides overload.py which allows you to create multiple UDFs in a single create function statement. Each function can take in different parameters as input as shown below:

CREATE FUNCTION test.f_plus_int
RETURNS Date
LANGUAGE PYTHON AS
$code$
from iudf import IUDF
from overload import overload

class f_plus_int(IUDF):
    @overload
    def process(a): 
        return a + 1

    @overload
    def process(a, b):
        return a + b
$code$;

@overload is the decorator that wraps the function and returns a callable object of type Function. If you use this entry function, you must use the decorator @overload in the function. You need to call from entry to avoid errors of having mismatched parameter numbers.