Unique Table
Introduction
Unique table, a self-developed feature by ByteHouse, is used to implement the upsert
function. It can maintain efficient query performance and support primary key update. It solves a major pain point of open-source ClickHouse - the inability to support efficient updates. With that, it helps businesses develop real-time analysis applications easily. Users can update write semantics by specifying the UNIQUE KEY, and the query automatically returns the latest value of each unique key.
Unique table has the following characteristics:
-
The user configures the unique key through UNIQUE KEY, provides the upsert update write semantics, and the query automatically returns the latest value of each unique key.
-
It ensures real-time update, while maintaining a high query performance, with almost no loss.
-
Supports deleting rows by deleting flags.
Real-time update use cases
-
A business needs a real-time analysis of transaction data. While synchronising the OLTP database to the OLAP database, to update capabilities such as order data, real-time update and deletion of the OLAP database is also needed.
-
Another use case is the need to be deduplicated although there is no update. When developing real-time systems, it is difficult to ensure no duplicate data in the data stream, so the storage system should support idempotent writing of data.
The unique key upsert function can support the above scenarios, whether it is an idempotent or an update requirement.
Upsert usage example
- Create a database and corresponding unique tables:
CREATE DATABASE upsertdb;
CREATE TABLE IF NOT EXISTS upsertdb.uniquetable
(
`event_time` DateTime,
`product_id` UInt64,
`city` String,
`category` String,
`amount` UInt32,
`revenue` UInt64
)
ENGINE = CnchMergeTree()
PARTITION BY toDate(event_time)
ORDER BY (city, category)
UNIQUE KEY (product_id, sipHash64(city));
-- UNIQUE KEY can contain multiple fields and expressions
- Insert the following fields in time series order:
INSERT INTO upsertdb.uniquetable VALUES
('2020-10-29 23:40:00', 10001, 'Beijing', 'Coffee', 5, 500),
('2020-10-29 23:40:00', 10002, 'Beijing', 'Coffee', 2, 200),
('2020-10-29 23:40:00', 10003, 'Beijing', 'Coffee', 1, 100),
('2020-10-29 23:50:00', 10002, 'Shanghai', 'Coffee', 4, 400),
('2020-10-29 23:50:00', 10003, 'Beijing', 'Coffee', 2, 200),
('2020-10-29 23:50:00', 10004, 'Beijing', 'Coffee', 1, 100);
--upsert semantics means insert data if the key does not exist, otherwise update this data
- Query the data in the table, which has been deduplicated:
select * from upsertdb.uniquetable;
┌──────event_time─────┬product_id─┬─city───┬─category─┬─amount─┬─revenue─┐
│ 2020-10-29 23:40:00 │ 10001 │ Beijing │ coffee │ 5 │ 500 │
│ 2020-10-29 23:40:00 │ 10002 │ Beijing │ coffee │ 2 │ 200 │
│ 2020-10-29 23:50:00 │ 10003 │ Beijing │ coffee │ 2 │ 200 │
│ 2020-10-29 23:50:00 │ 10004 │ Beijing │ coffee │ 1 │ 100 │
│ 2020-10-29 23:50:00 │ 10002 │ Shanghai │ coffee │ 4 │ 400 │
└─────────────────────┴─────────┴──────────┴────────┴─────────┴─────────┘
Delete Records
- Delete the specified key by setting the virtual column delete_flag = 1.
- Insert data with the following statement, specifying the deletion flag bit:
--Specify the delete field for data deletion. When the delete field is set to non-0, it means deletion, and when it is set to 0, it means normal upsert operation
INSERT INTO upsertdb.uniquetable (event_time, product_id, city, category, amount, revenue, _delete_flag_) VALUES
('2020-10-29 23:50:00', 10001, 'Beijing', 'coffee', 4, 400, 5),
('2020-10-29 23:50:00', 10002, 'Beijing', 'coffee', 2, 200, 1),
('2020-10-29 23:50:00', 10004, 'Beijing', 'coffee', 3, 600, 0);
- Check the effect after deleting:
select * from upsertdb.uniquetable order by toDate(event_time), product_id;
As shown below, the query results contain one row of data replacing product_id = 10004, and delete three rows of old data proudct_id = 10001 or 10002 and the city is 'Beijing'.
┌──────────event_time─┬─product_id─┬─city────┬─category─┬─amount─┬─revenue─┐
│ 2020-10-29 23:50:00 │ 10002 │ Shanghai │ coffee │ 4 │ 400 │
│ 2020-10-29 23:50:00 │ 10003 │ Beijing │ coffee │ 2 │ 200 │
│ 2020-10-29 23:50:00 │ 10004 │ Beijing │ coffee │ 3 │ 600 │
└─────────────────────┴────────────┴──────────┴─────────┴────────┴─────────┘
Unique table usage restrictions
- When importing Kafka data sources, users need to ensure that data with the same unique key is written to the same Topic Partition, and Topic expansion is disabled.
- Full table fields cannot be updated after Unique table is built.
- With Unique Key, you can specify up to 10 fields.
Updated 5 days ago