User-defined functions (UDF)
Currently available in the AWS US environment only
User-defined functions (UDF) are routines that accept parameters, perform actions like complex calculations, and return the result of that action as a value. The return value can either be a single scalar value or a result set.
ByteHouse supports the following types of UDF:
- Lambda UDF: User-defined lambda function
- Python UDF: Define functions in Python and return scalar result
1. Lambda UDF
Creates a UDF from a lambda expression. The expression must contain function parameters, constants, operators, or other function calls.
Syntax
CREATE FUNCTION name AS (parameter0, ...) -> expression
For more general syntax instructions, please refer to the Python UDF syntax section
Restrictions
A function can have an arbitrary number of parameters, with a few restrictions:
- The name of a function must be unique, different from other user-defined functions and system functions.
- Recursive lambda functions are not allowed.
- All variables used by a function must be specified in its parameter list.
If any of the above rules are violated, an exception will be raised.
Examples
As a UDF is created in a specific database, you need to create and specify a database to start creating and using UDF.
CREATE DATABASE myudf;
USE myudf;
Example 1
CREATE FUNCTION linear_equation AS (x, k, b) -> k*x + b;
SELECT number, linear_equation(number, 2, 1) FROM numbers(3);
┌─number─┬─plus(multiply(2, number), 1)─┐
│ 0 │ 1 │
│ 1 │ 3 │
│ 2 │ 5 │
└────────┴──────────────────────────────┘
Example 2
A condition function is called in a user-defined function in the following query:
CREATE FUNCTION parity_str AS (n) -> if(n % 2, 'odd', 'even');
SELECT number, parity_str(number) FROM numbers(3);
┌─number─┬─if(modulo(number, 2), 'odd', 'even')─┐
│ 0 │ even │
│ 1 │ odd │
│ 2 │ even │
└────────┴──────────────────────────────────────┘
2. Python UDF
Introduction
A Python UDF is a python script that can be called by ByteHouse. ByteHouse now supports Python written UDF. A scalar Python UDF incorporates a Python program that runs when the function is called and returns a single value.
When a query calls a Python UDF, the following steps occur at runtime:
- The function converts the input arguments to Python Numpy data types.
- The function passes the converted input arguments then runs the Python program.
- The Python code returns a single value. The data type of the return value must correspond to the RETURNS data type specified by the function definition.
- The function converts Python's return value to the defined data type, then returns that value to the query.
System Architect

ByteHouse Cloud UDF capabilities ensure isolation while ensuring high performance and security:
- Each server and worker container will have its own UDF sidecar container
- High performance: Extended Python to access ClickHouse's raw vectorized memory data format, batch processing and zero network IO cost
- Security: Minimized external access for UDF container
- Only shared volume access is supported and network access can be disabled
- Process isolation inside the UDF container
Data Types
SQL Type | Python Type | Notes |
---|---|---|
UInt8 | NPY_UINT8 | |
UInt16 | NPY_UINT16 | |
UInt32 | NPY_UINT32 | |
UInt64 | NPY_UIN64 | |
Int8 | NPY_INT8 | |
Int16 | NPY_INT16 | |
Int32 | NPY_INT32 | |
Int64 | NPY_INT64 | |
Float32 | NPY_FLOAT32 | |
Float64 | NPY_FLOAT64 | |
Decimal32 | Cast to Float | |
Decimal64 | Cast to Float | |
UUID | NPY_STRING | Fixed length of 16 bytes |
Date | NPY_UINT16 | https://numpy.org/doc/stable/reference/arrays.datetime.html |
DateTime | NPY_UINT32 | |
String | NPY_STRING | |
FixedString | NPY_STRING | |
Nullable | masked array | https://numpy.org/doc/stable/reference/maskedarray.html |
Array | ndarray | Only support input as Array. https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html |
-
Date Type: Stored in two bytes as the number of days since 1970-01-01 (unsigned). Supported range of values: [1970-01-01, 2148-12-31].
Allows storing values from just after the beginning of the Unix Epoch to the upper threshold defined by a constant at the compilation stage (currently, this is until the year 2149, but the final fully-supported year is 2148).
-
DateTime Type: Stored in four bytes as the number of seconds since 1970-01-01 00:00:00 (unsigned). Supported range of values: [1970-01-01 00:00:00, 2105-12-31 23:59:59].
Runtime Manual
ByteHouse provides standard libraries, and we can also help users to pre-install customized dependencies if needed.
Syntax
1. CREATE Function
CREATE FUNCTION [database_name.]function_name [TYPE udf_type] [RETURNS data_type] [FLAGS flags] LANGUAGE PYTHON AS
$custom_tag$
class function_name():
def entry(self, output, inputs):
expression
$custom_tag$
# database_name is optional, current database will be used if not set
# udf_type: SCALAR (default), AGGREGATE (Not supported for current release)
# data_type: use the type of the first input column if not set, check 'Data Types' section above for all supported types
Flags
The following flags are stored in BitMap. These flags are defined to optimize queries as well as to define certain functionality. When creating a UDF, it is optional to specify the flags.
UseDefaultImplementationForLowCardinalityColumns, /* enable always. convert all low cardinality columsn to ordinary column */
UseDefaultImplementationForConstants, /* convert to ordinary column if all arguments are constant */
UseDefaultImplementationForNulls, /* peel off nullable and use nested column */
SuitableForShortCircuitArgumentsExecution, /* evaluate this function lazily in short-circuit function arguments */
SuitableForConstantFolding, /* enable early execution if all arguments are constant */
CanBeExecutedOnLowCardinalityDictionary, /* disable always. not support for now */
CanBeExecutedOnDefaultArguments, /* conterexample: modulo(0,0) */
DeterministicInScopeOfQuery, /* return same result for same values during single query. example: now() */
Deterministic, /* return same result for same values. counterexample: rand() */
AlwaysPositiveMonotonicity, /* Positive monotonicity */
AlwaysNegativeMonotonicity, /* Negative monotonicity */
Injective, /* func should return unique result if inputs are different. optimization for group by */
Stateful, /* for pushdown and merge optimization */
-
$custom_tag$
$custom_tag$
is a dollar-quoted string. In between the $$ symbols, you can use any custom tag or leave it empty, just make sure to close it again with the same tag. (Refer to PostgreSQL 4.1.2.4. Dollar-quoted String Constants).For example, here are two different ways to specify the string "Dianne's horse" using dollar quoting:
$$Dianne's horse$$
$SomeTag$Dianne's horse$SomeTag$
-- Everything inside $tag$ is treated as python code.
-- No python syntax checks will be done during creation of Python UDF.
2. DROP Function
DROP FUNCTION [IF EXISTS] [db_name].function_name;
IF EXISTS
is optional, it prevents errors if the function doesn't exist. In a normal case, if the function doesn't exist and you try to drop the function, you will face an error.db_name
is optional if you useUSE db_name
to check the database initially.
3. SHOW Function
To see all UDF functions and specific UDF of the currently used DB function list
SHOW functions; -- view all udf
DESCRIBE FUNCTION udf_name; --- view specific udf
4. SHOW CREATE Function
To see the definition of the specific UDF
SHOW CREATE FUNCTION udf_name;
5. SELECT Function
UDF works like any other function in ByteHouse, but it requires you to specify a database name to access it. For example, to use a UDF which calculates the area of a circle:
drop function if exists test.area_of_circle;
CREATE FUNCTION test.area_of_circle RETURNS Float64 LANGUAGE PYTHON AS
$pikachu$
from iudf import IUDF
from overload import overload
from math import pi
class area_of_circle(IUDF):
@overload
def process(radius):
return radius*radius*pi
$pikachu$;
Examples
Example 1:
This UDF example will add the value of each row in this table and output the result.
- Create the Database and Table for UDF
Before creating a Python UDF, you must specify the database name and table.
CREATE DATABASE test;
CREATE TABLE test.test_udf_int
(
a Int64,
b Int64,
c Int64,
d Int64
)
ENGINE = CnchMergeTree()
PRIMARY KEY a
ORDER BY a;
INSERT INTO test.test_udf_int(a, b, c, d)
VALUES (0, 0, 0, 0 ) (1, 1, 1, 1) (2, 2, 2, 2);
-
Create the UDF
Next, create the UDF function by usingCREATE FUNCTION
. This is an example of how to write the entry function in Python and uses[code]
as the delimiter.As ByteHouse is a column-based database, but generally you want to do the calculation for each row. In this example, the entry function uses an iterator to loop through each row and sum up the four input values in each row to get the output value.
CREATE FUNCTION test.f_plus_int RETURNS Int64 LANGUAGE PYTHON AS
$code$
import numpy as np
class f_plus_int:
def entry(self, output, inputs):
nargs = len(inputs)
it = np.nditer([output] + inputs,
['refs_ok', 'buffered'], [['writeonly']] + [['readonly']]*nargs)
with it:
for out, *ins in it:
out[...] = self.process(*[x.item() for x in ins])
def process(self, x1, y1, x2, y2):
return x1 + y2 + x2 + y2
$code$;
Alternatively, you can use the fully column-based method provided by Python Numpy as shown below. It does the same operation as test.f_plus_int
.
Note: The performance of using the column-based method is better than using loops.
CREATE FUNCTION test.f_plus_int RETURNS Int64 LANGUAGE PYTHON AS
$code$
import numpy as np
class f_plus_int:
def entry(self, output, inputs):
np.add.reduce(inputs, out=output)
$code$;
- Call the UDF
After the Python UDF function is created, it can be called just like a general built-in function.
select test.f_plus_int(a,b,c,d) from test.test_udf_int;
┌─test.f_plus_int(a, b, c, d)─┐
│ 0 │
│ 4 │
│ 8 │
└─────────────────────────────┘
- Drop the UDF
drop function test.f_plus_int;
Example 2:
- Create the Database and Table for UDF
CREATE TABLE test.py_fixedstring
(
a FixedString(2),
b FixedString(3)
)
ENGINE = CnchMergeTree()
ORDER BY b;
INSERT INTO test.py_fixedstring(a, b)
VALUES ('aa', 'aaa') ('bb', 'bbb') ('cc', 'ccc') ('dd', 'ddd') ('ee', 'eee') ('ff', 'fff') ('gg', 'ggg') ('ab', 'abc') ('cb', 'cba') ('tt', 'zzz');
- Create the UDF
For return types likeFixedString
, you must specify a length for the returned string.
CREATE FUNCTION test.py_fixed_string
RETURNS FixedString(5)
LANGUAGE PYTHON AS
$code$
import numpy as np
class py_fixed_string:
def entry(self, output, inputs):
nargs = len(inputs)
it = np.nditer([output] + inputs,
['refs_ok', 'buffered'], [['writeonly']] + [['readonly']]*nargs)
with it:
for out, *ins in it:
out[...] = self.process(*[x.item() for x in ins])
def process(self, a, b):
return a + b
$code$;
- Call the UDF
select test.py_fixed_string(a, b) from test.py_fixedstring;
┌─test.py_fixed_string(a, b)─┐
│ aaaaa │
│ ababc │
│ bbbbb │
│ cbcba │
│ ccccc │
│ ddddd │
│ eeeee │
│ fffff │
│ ggggg │
│ ttzzz │
└────────────────────────────┘
- Drop the UDF
drop function test.py_fixed_string;
Example 3:
- Create the Database and Table you need for UDF
CREATE TABLE test.test_udf_array
(
a Array(UInt8)
)
ENGINE = CnchMergeTree()
PRIMARY KEY a
ORDER BY a;
insert into test.test_udf_array values ([1,2,3]), ([6,7,8]);
- Create the UDF
In the entry function, this example just uses a simple for loop to do the calculation.
CREATE FUNCTION test.f_test_array RETURNS UInt64 LANGUAGE PYTHON AS
$code$
import numpy as np
class f_test_array:
def entry(self, output, inputs):
for i in range(0, len(output)):
output[i] = np.sum(inputs[0][i])
$code$;
- Call the UDF
select test.f_test_array(a) from test.test_udf_array;
┌─test.f_test_array(a)─┐
│ 6 │
│ 21 │
└──────────────────────┘
- Drop the UDF
drop test.test_udf_array
Overload Function
In addition, ByteHouse provides overload.py
which allows you to create multiple UDFs in a single create function statement. Each function can take in different parameters as input as shown below:
CREATE FUNCTION test.f_plus_int
RETURNS Date
LANGUAGE PYTHON AS
$code$
from iudf import IUDF
from overload import overload
class f_plus_int(IUDF):
@overload
def process(a):
return a + 1
@overload
def process(a, b):
return a + b
$code$;
@overload
is the decorator that wraps the function and returns a callable object of type Function. If you use this entry function, you must use the decorator@overload
in the function. You need to call from entry to avoid errors of having mismatched parameter numbers.
Updated 7 days ago