Work with Unique Table

Introduction

ByteHouse Unique table is mainly used to implement the upsert function. This capability is a self-developed feature provided by ByteHouse. It can maintain efficient query performance as well as support primary key update. It mainly solves the pain point of open-source ClickHouse which cannot support efficient update operations. With that, it helps businesses to develop real-time analysis applications more easily. Users can achieve update write semantics by specifying the UNIQUE KEY, and the query automatically returns the latest value of each unique key.

Unique table mainly has the following characteristics:

  1. The user configures the unique key through UNIQUE KEY, provides upsert update write semantics, and the query automatically returns the latest value of each unique key
  2. In the case of ensuring real-time update capability, still maintain high query performance, almost no loss
  3. Support deleting rows by deleting flags

Real-time update scenarios

  1. The business needs real-time analysis of transaction data. In the process of synchronizing the OLTP database to the OLAP database, due to the need to support update capabilities such as order data, there are also requirements for supporting real-time update and deletion of the OLAP database.
  2. Another type of scenario needs to be deduplicated although there is no update. When developing real-time data, it is difficult to ensure that there is no duplicate data in the data stream, so the storage system is usually required to support the idempotent writing of data.

The above scenarios can be supported by the unique key upsert function, whether it is idempotent or updated requirements.

Upsert Usage Example

Create a database and corresponding Unique tables

CREATE DATABASE upsertdb;
CREATE TABLE IF NOT EXISTS upsertdb.uniquetable
(
  `event_time` DateTime,
  `product_id` UInt64,
  `city` String,
  `category` String,
  `amount` UInt32,
  `revenue` UInt64
)
ENGINE = CnchMergeTree()
PARTITION BY toDate(event_time)
ORDER BY (city, category)
UNIQUE KEY (product_id, sipHash64(city));
-- UNIQUE KEY can contain multiple fields and expressions

Insert the following fields in time series order:

INSERT INTO upsertdb.uniquetable VALUES
('2020-10-29 23:40:00', 10001, 'Beijing', 'Coffee', 5, 500),
('2020-10-29 23:40:00', 10002, 'Beijing', 'Coffee', 2, 200),
('2020-10-29 23:40:00', 10003, 'Beijing', 'Coffee', 1, 100),
('2020-10-29 23:50:00', 10002, 'Shanghai', 'Coffee', 4, 400),
('2020-10-29 23:50:00', 10003, 'Beijing', 'Coffee', 2, 200),
('2020-10-29 23:50:00', 10004, 'Beijing', 'Coffee', 1, 100);
--upsert semantics means insert data if the key does not exist, otherwise update this data

Query the data in the table, which has been deduplicated:

select * from upsertdb.uniquetable;
┌──────event_time─────┬product_id─┬─city───┬─category─┬─amount─┬─revenue─┐
│ 2020-10-29 23:40:00 │ 10001   │ Beijing  │ coffee  │    5    │    500 │
│ 2020-10-29 23:40:00 │ 10002   │ Beijing  │ coffee  │    2    │    200 │
│ 2020-10-29 23:50:00 │ 10003   │ Beijing  │ coffee  │    2    │    200 │
│ 2020-10-29 23:50:00 │ 10004   │ Beijing  │ coffee  │    1    │    100 │
│ 2020-10-29 23:50:00 │ 10002   │ Shanghai │ coffee  │    4    │    400 │
└─────────────────────┴─────────┴──────────┴────────┴─────────┴─────────┘

Delete Records

Delete the specified key by setting the virtual column delete_flag = 1
Insert data with the following statement, specifying the deletion flag bit:

--Specify the delete field for data deletion. When the delete field is set to non-0, it means deletion, and when it is set to 0, it means normal upsert operation
INSERT INTO upsertdb.uniquetable (event_time, product_id, city, category, amount, revenue, _delete_flag_) VALUES
('2020-10-29 23:50:00', 10001, 'Beijing', 'coffee', 4, 400, 5),
('2020-10-29 23:50:00', 10002, 'Beijing', 'coffee', 2, 200, 1),
('2020-10-29 23:50:00', 10004, 'Beijing', 'coffee', 3, 600, 0);

Check the effect after deleting:

select * from upsertdb.uniquetable order by toDate(event_time), product_id;

As shown below, the query results contain one row of data replacing product_id = 10004, and delete three rows of old data proudct_id = 10001 or 10002 and the city is'Beijing '

┌──────────event_time─┬─product_id─┬─city────┬─category─┬─amount─┬─revenue─┐
│ 2020-10-29 23:50:00 │      10002 │ Shanghai │ coffee  │      4 │     400 │
│ 2020-10-29 23:50:00 │      10003 │ Beijing  │ coffee  │      2 │     200 │
│ 2020-10-29 23:50:00 │      10004 │ Beijing  │ coffee  │      3 │     600 │
└─────────────────────┴────────────┴──────────┴─────────┴────────┴─────────┘

Unique table usage restrictions

  • When importing Kafka data sources, users need to ensure that data with the same unique key is written to the same Topic Partition, and Topic expansion is disabled;
  • Full table fields cannot be updated after Unique table is built
  • Unique Key Unique Key supports specifying up to 10 fields