Bucket table
Bucket table is a performance optimisation feature used by ByteHouse while creating tables. When used, the system will organise table data based on one or more columns or expressions provided in the CREATE TABLE statement. Data with the same value will be clustered under the same bucket number, improving query performance.
Benefits of a bucket table
Bucketing or clustering data using a cluster key on large tables provides the following benefits:
- Point queries on cluster key filter out most of the data, reducing IO volume to achieve shorter execution time and higher concurrent QPS.
- Aggregation calculations on cluster key use less memory and shorter execution time, and can be further optimised by distributed_perfect_shard.
- Joins on two or more tables using the cluster key benefit from co-located joins, greatly reducing the amount of shuffled data and obtaining shorter execution time.
When to use a bucket table
Using a bucket table is most beneficial when:
- The table is large enough, meaning that the number of parts in a partition is significantly greater than the number of workers (compute nodes).
- Query statements can benefit from the advantages listed in the section above.
Choosing a cluster key
A cluster key can be one or more columns or expressions. We recommend to using a maximum of three fields. More fields introduce high write costs and benefit a smaller range of statements.
Choosing the right cluster key has a significant impact on performance, so keep the following in mind:
- Columns that are frequently used for equal and IN filtering.
- Frequently used aggregation columns.
- Join keys for multiple tables.
If the most common case of the above scenarios is a combination of two columns, such as a = 1 and b = 2, selecting two columns as the cluster key can achieve better results.
Another dimension to consider is the number of distinct values in the column:
- The number of distinct values needs to be greater than the number of workers.
- If the number of distinct values is small, it is best to keep it a multiple of the number of workers to achieve a more balanced data distribution during query.
Choosing a bucket number
Within a partition:
- Data will be assigned to parts files of the corresponding bucket number based on the value of the cluster key.
- Parts files with the same bucket number will be sent to the same worker node for calculation during query.
Therefore, choosing a suitable bucket number has a significant impact on storage and query. Generally, the following principles can be followed:
- Ensure that the bucket number is a multiple of the number of workers to ensure a balanced allocation during the query. It is generally recommended to set it to 1x or 2x (with residual expansion capacity) of the number of worker nodes.
- Ensure that there is enough data in a bucket number in a partition. Do not generate parts that are too small. Therefore, if the table is relatively small, ensure that the size of a parts file for a bucket number in a partition is larger than 1GB. Do not set a bucket number higher than the number of workers.
Deciding to modify the cluster
During operation, due to changes in data, query mode, and the number of worker nodes, users may want to reset the cluster key and bucket number. The cost of making modifications should be factored in when deciding if and when modifications are necessary.
- To modify the cluster by reclustering of existing data, the estimated time for recluster execution needs to be calculated.
- During the recluster, queries on existing data will be rolled back to a normal table, and all benefits of bucket tables will be temporarily lost.
- Reclustering will consume the write worker’s resources. So it is necessary to evaluate if the current CNCH cluster has independent write workers and their current workloads. This helps assess its impact on existing tasks, such as queries and merging.
There are two scenarios:
- Modifying the cluster key:
a. This means that the current data can no longer benefit from bucket tables, so the loss of benefits during the recluster period need not be considered.
b. It is necessary to evaluate the impact of the recluster task on existing tasks to determine if it should be executed. - Modifying the bucket number:
a. The benefits of the current bucket table are still present, so it is necessary to confirm with the business side the acceptable time for performance degradation and then determine whether the recluster can be executed based on the recluster time, as well as the start time for the execution.
b. It is also necessary to evaluate the impact of the recluster task on existing tasks to determine if it should be executed.
Example
A ByteHouse tenant has six compute nodes. Due to the large volume of data in a single partition, exceeding 200 million records, the application frequently performs aggregation and join operations based on the c1 and c2 fields. Therefore, the decision was made to optimize using bucket tables. The design options for the bucket table are as follows:
- Cluster key selection: Choose the c1 and c2 columns as the cluster key.
- Bucket number: Choose twice the number of nodes, 12.
The complete DDL is as follows
create or replace table table_01 (c1 timestamp, c2 string, c3 number) cluster by (to_date(c1), c2) INTO 12 BUCKETS;
-- cluster by expressions
Updated 2 months ago