Setting up a Proof of Concept (PoC)
This guide will take you through the steps to test ByteHouse performance with the Star Schema Benchmark (SSB) dataset.
Prerequisites
-
Make sure you have a ByteHouse account. Or, create one following the instructions in the quick start user guide.
-
Make sure you have the ByteHouse CLI downloaded. Get the latest version here. Follow this guide for CLI configuration.
Steps
1. Compile sample SSB dataset
The following commands generate about 67 GB of SSB dataset and may take some time to complete.
$ git clone https://github.com/ChrisZou/ssb-dbgen
$ cd ssb-dbgen
$ make
$ ./dbgen -s 100 -T c
$ ./dbgen -s 100 -T l
$ ./dbgen -s 100 -T p
$ ./dbgen -s 100 -T s
2. Connect to ByteHouse
Use the CLI to connect to ByteHouse
- Please replace the username, account, password and region as per your credentials.
- Include the timeout settings for long queries.
# Using Region
bytehouse-cli --user bob --account AWSXXXX --password coolbob --region AP-SOUTHEAST-1 --secure true --port 19000 --send_timeout=3600 --receive_timeout=3600
# For Host
bytehouse-cli --user bob --account AWSXXXX --password coolbob --host gateway.aws-us-east-1.bytehouse.cloud --secure true --port 19000 --send_timeout=3600 --receive_timeout=3600
3. Create database and tables
3.1 Create a database for SSB dataset
CREATE DATABASE ssb
3.2 Create four tables in ByteHouse for data
CREATE TABLE ssb.customer
(
C_CUSTKEY UInt32,
C_NAME String,
C_ADDRESS String,
C_CITY LowCardinality(String),
C_NATION LowCardinality(String),
C_REGION LowCardinality(String),
C_PHONE String,
C_MKTSEGMENT LowCardinality(String)
)
ENGINE = CnchMergeTree ORDER BY (C_CUSTKEY);
CREATE TABLE ssb.lineorder
(
LO_ORDERKEY UInt32,
LO_LINENUMBER UInt8,
LO_CUSTKEY UInt32,
LO_PARTKEY UInt32,
LO_SUPPKEY UInt32,
LO_ORDERDATE Date,
LO_ORDERPRIORITY LowCardinality(String),
LO_SHIPPRIORITY UInt8,
LO_QUANTITY UInt8,
LO_EXTENDEDPRICE UInt32,
LO_ORDTOTALPRICE UInt32,
LO_DISCOUNT UInt8,
LO_REVENUE UInt32,
LO_SUPPLYCOST UInt32,
LO_TAX UInt8,
LO_COMMITDATE Date,
LO_SHIPMODE LowCardinality(String)
)
ENGINE = CnchMergeTree PARTITION BY toYear(LO_ORDERDATE) ORDER BY (LO_ORDERDATE, LO_ORDERKEY);
CREATE TABLE ssb.part
(
P_PARTKEY UInt32,
P_NAME String,
P_MFGR LowCardinality(String),
P_CATEGORY LowCardinality(String),
P_BRAND LowCardinality(String),
P_COLOR LowCardinality(String),
P_TYPE LowCardinality(String),
P_SIZE UInt8,
P_CONTAINER LowCardinality(String)
)
ENGINE = CnchMergeTree ORDER BY P_PARTKEY;
CREATE TABLE ssb.supplier
(
S_SUPPKEY UInt32,
S_NAME String,
S_ADDRESS String,
S_CITY LowCardinality(String),
S_NATION LowCardinality(String),
S_REGION LowCardinality(String),
S_PHONE String
)
ENGINE = CnchMergeTree ORDER BY S_SUPPKEY;
4. Insert data
Insert the data into the four tables using the CLI, as shown in the example below.
Make sure you are in the ssb-dbgen
directory. Replace username, account name, password and region as well as the file name and table name accordingly.
#Connecting using Region
cat customer.tbl | sed 's/,*\r*$//' | bytehouse-cli --user username --account AWSXXXX --password password --region AP-SOUTHEAST-1 --secure -w default_vw --database ssb -q "INSERT INTO customer FORMAT CSV" --send_timeout=3600 --receive_timeout=3600
#Connecting using Host
cat customer.tbl | sed 's/,*\r*$//' | bytehouse-cli --user username --account AWSXXXX --password password --host gateway.aws-us-east-1.bytehouse.cloud --secure -w default_vw --database ssb -q "INSERT INTO customer FORMAT CSV" --send_timeout=3600 --receive_timeout=3600
File name | Table name | Dataset Size | Remarks |
---|---|---|---|
customer.tbl | customer | 3,000,000 rows, 328.53 MB | |
part.tbl | part | 1,310,720 rows, 137.16 MB | |
supplier.tbl | supplier | 131,072 rows, 17.06 MB | |
lineorder.tbl | lineorder | 600,037,902 total rows, 70.37 GB | Can take up to 40 min to run |
5. Transform the "star schema" to denormalised "flat schema"
5.1 Create a new table lineorder_flat
lineorder_flat
CREATE TABLE ssb.lineorder_flat
(
LO_ORDERKEY UInt32,
LO_LINENUMBER UInt8,
LO_CUSTKEY UInt32,
LO_PARTKEY UInt32,
LO_SUPPKEY UInt32,
LO_ORDERDATE Date,
LO_ORDERPRIORITY LowCardinality(String),
LO_SHIPPRIORITY UInt8,
LO_QUANTITY UInt8,
LO_EXTENDEDPRICE UInt32,
LO_ORDTOTALPRICE UInt32,
LO_DISCOUNT UInt8,
LO_REVENUE UInt32,
LO_SUPPLYCOST UInt32,
LO_TAX UInt8,
LO_COMMITDATE Date,
LO_SHIPMODE LowCardinality(String),
C_NAME String,
C_ADDRESS String,
C_CITY LowCardinality(String),
C_NATION LowCardinality(String),
C_REGION LowCardinality(String),
C_PHONE String,
C_MKTSEGMENT LowCardinality(String),
S_NAME String,
S_ADDRESS String,
S_CITY LowCardinality(String),
S_NATION LowCardinality(String),
S_REGION LowCardinality(String),
S_PHONE String,
P_NAME String,
P_MFGR LowCardinality(String),
P_CATEGORY LowCardinality(String),
P_BRAND LowCardinality(String),
P_COLOR LowCardinality(String),
P_TYPE LowCardinality(String),
P_SIZE UInt8,
P_CONTAINER LowCardinality(String)
)
ENGINE = CnchMergeTree
ORDER BY (LO_ORDERDATE, LO_ORDERKEY)
5.2 Log in to ByteHouse
5.3 Go to virtual warehouses and create a new virtual warehouse with size “M” and name “vw_test”
5.4 Use the CLI to convert the four tables into a flat table. This can take up to 40 minutes.
Make sure your CLI is connected to ByteHouse. If it isn't, you can refer this command.
Note: To set longer timeout \--send_timeout=10000000 --receive_timeout=10000000
bytehouse-cli --user bob --account AWSXXXX --password coolbob --region AP-SOUTHEAST-1 --secure true --port 19000 --send_timeout=10000000 --receive_timeout=10000000
Run the SQL below on CLI.
set warehouse ='vw_test';
INSERT INTO `ssb`.`lineorder_flat`
SELECT
l.LO_ORDERKEY AS LO_ORDERKEY,
l.LO_LINENUMBER AS LO_LINENUMBER,
l.LO_CUSTKEY AS LO_CUSTKEY,
l.LO_PARTKEY AS LO_PARTKEY,
l.LO_SUPPKEY AS LO_SUPPKEY,
l.LO_ORDERDATE AS LO_ORDERDATE,
l.LO_ORDERPRIORITY AS LO_ORDERPRIORITY,
l.LO_SHIPPRIORITY AS LO_SHIPPRIORITY,
l.LO_QUANTITY AS LO_QUANTITY,
l.LO_EXTENDEDPRICE AS LO_EXTENDEDPRICE,
l.LO_ORDTOTALPRICE AS LO_ORDTOTALPRICE,
l.LO_DISCOUNT AS LO_DISCOUNT,
l.LO_REVENUE AS LO_REVENUE,
l.LO_SUPPLYCOST AS LO_SUPPLYCOST,
l.LO_TAX AS LO_TAX,
l.LO_COMMITDATE AS LO_COMMITDATE,
l.LO_SHIPMODE AS LO_SHIPMODE,
c.C_NAME AS C_NAME,
c.C_ADDRESS AS C_ADDRESS,
c.C_CITY AS C_CITY,
c.C_NATION AS C_NATION,
c.C_REGION AS C_REGION,
c.C_PHONE AS C_PHONE,
c.C_MKTSEGMENT AS C_MKTSEGMENT,
s.S_NAME AS S_NAME,
s.S_ADDRESS AS S_ADDRESS,
s.S_CITY AS S_CITY,
s.S_NATION AS S_NATION,
s.S_REGION AS S_REGION,
s.S_PHONE AS S_PHONE,
p.P_NAME AS P_NAME,
p.P_MFGR AS P_MFGR,
p.P_CATEGORY AS P_CATEGORY,
p.P_BRAND AS P_BRAND,
p.P_COLOR AS P_COLOR,
p.P_TYPE AS P_TYPE,
p.P_SIZE AS P_SIZE,
p.P_CONTAINER AS P_CONTAINER
FROM ssb.lineorder AS l
INNER JOIN ssb.customer AS c ON c.C_CUSTKEY = l.LO_CUSTKEY
INNER JOIN ssb.supplier AS s ON s.S_SUPPKEY = l.LO_SUPPKEY
INNER JOIN ssb.part AS p ON p.P_PARTKEY = l.LO_PARTKEY
SETTINGS send_timeout = 10000000, receive_timeout = 10000000, max_execution_time=10000000, exchange_timeout = 1000000, enable_optimizer=0;
6. Running queries
Q1.1
SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM ssb.lineorder_flat
WHERE toYear(LO_ORDERDATE) = 1993 AND LO_DISCOUNT BETWEEN 1 AND 3 AND LO_QUANTITY < 25;
Q1.2
SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM ssb.lineorder_flat
WHERE toYYYYMM(LO_ORDERDATE) = 199401 AND LO_DISCOUNT BETWEEN 4 AND 6 AND LO_QUANTITY BETWEEN 26 AND 35;
Q1.3
SELECT sum(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue
FROM ssb.lineorder_flat
WHERE toISOWeek(LO_ORDERDATE) = 6 AND toYear(LO_ORDERDATE) = 1994
AND LO_DISCOUNT BETWEEN 5 AND 7 AND LO_QUANTITY BETWEEN 26 AND 35;
Q2.1
SELECT
sum(LO_REVENUE),
toYear(LO_ORDERDATE) AS year,
P_BRAND
FROM ssb.lineorder_flat
WHERE P_CATEGORY = 'MFGR#12' AND S_REGION = 'AMERICA'
GROUP BY
year,
P_BRAND
ORDER BY
year,
P_BRAND;
Q2.2
SELECT
sum(LO_REVENUE),
toYear(LO_ORDERDATE) AS year,
P_BRAND
FROM ssb.lineorder_flat
WHERE P_BRAND >= 'MFGR#2221' AND P_BRAND <= 'MFGR#2228' AND S_REGION = 'ASIA'
GROUP BY
year,
P_BRAND
ORDER BY
year,
P_BRAND;
Q2.3
SELECT
sum(LO_REVENUE),
toYear(LO_ORDERDATE) AS year,
P_BRAND
FROM ssb.lineorder_flat
WHERE P_BRAND = 'MFGR#2239' AND S_REGION = 'EUROPE'
GROUP BY
year,
P_BRAND
ORDER BY
year,
P_BRAND;
Q3.1
SELECT
C_NATION,
S_NATION,
toYear(LO_ORDERDATE) AS year,
sum(LO_REVENUE) AS revenue
FROM ssb.lineorder_flat
WHERE C_REGION = 'ASIA' AND S_REGION = 'ASIA' AND year >= 1992 AND year <= 1997
GROUP BY
C_NATION,
S_NATION,
year
ORDER BY
year ASC,
revenue DESC;
Q3.2
SELECT
C_CITY,
S_CITY,
toYear(LO_ORDERDATE) AS year,
sum(LO_REVENUE) AS revenue
FROM ssb.lineorder_flat
WHERE C_NATION = 'UNITED STATES' AND S_NATION = 'UNITED STATES' AND year >= 1992 AND year <= 1997
GROUP BY
C_CITY,
S_CITY,
year
ORDER BY
year ASC,
revenue DESC;
Q3.3
SELECT
C_CITY,
S_CITY,
toYear(LO_ORDERDATE) AS year,
sum(LO_REVENUE) AS revenue
FROM ssb.lineorder_flat
WHERE (C_CITY = 'UNITED KI1' OR C_CITY = 'UNITED KI5') AND (S_CITY = 'UNITED KI1' OR S_CITY = 'UNITED KI5') AND year >= 1992 AND year <= 1997
GROUP BY
C_CITY,
S_CITY,
year
ORDER BY
year ASC,
revenue DESC;
Q3.4
SELECT
C_CITY,
S_CITY,
toYear(LO_ORDERDATE) AS year,
sum(LO_REVENUE) AS revenue
FROM ssb.lineorder_flat
WHERE (C_CITY = 'UNITED KI1' OR C_CITY = 'UNITED KI5') AND (S_CITY = 'UNITED KI1' OR S_CITY = 'UNITED KI5') AND toYYYYMM(LO_ORDERDATE) = 199712
GROUP BY
C_CITY,
S_CITY,
year
ORDER BY
year ASC,
revenue DESC;
Q4.1
SELECT
toYear(LO_ORDERDATE) AS year,
C_NATION,
sum(LO_REVENUE - LO_SUPPLYCOST) AS profit
FROM ssb.lineorder_flat
WHERE C_REGION = 'AMERICA' AND S_REGION = 'AMERICA' AND (P_MFGR = 'MFGR#1' OR P_MFGR = 'MFGR#2')
GROUP BY
year,
C_NATION
ORDER BY
year ASC,
C_NATION ASC;
Q4.2
SELECT
toYear(LO_ORDERDATE) AS year,
S_NATION,
P_CATEGORY,
sum(LO_REVENUE - LO_SUPPLYCOST) AS profit
FROM ssb.lineorder_flat
WHERE C_REGION = 'AMERICA' AND S_REGION = 'AMERICA' AND (year = 1997 OR year = 1998) AND (P_MFGR = 'MFGR#1' OR P_MFGR = 'MFGR#2')
GROUP BY
year,
S_NATION,
P_CATEGORY
ORDER BY
year ASC,
S_NATION ASC,
P_CATEGORY ASC;
Q4.3
SELECT
toYear(LO_ORDERDATE) AS year,
S_CITY,
P_BRAND,
sum(LO_REVENUE - LO_SUPPLYCOST) AS profit
FROM ssb.lineorder_flat
WHERE S_NATION = 'UNITED STATES' AND (year = 1997 OR year = 1998) AND P_CATEGORY = 'MFGR#14'
GROUP BY
year,
S_CITY,
P_BRAND
ORDER BY
year ASC,
S_CITY ASC,
P_BRAND ASC;
VM Size M
Query | Time |
---|---|
Q1.1 | 942ms |
Q1.2 | 136ms |
Q1.3 | 79ms |
Q2.1 | 3s |
Q2.2 | 689ms |
Q2.3 | 616ms |
Q3.1 | 1s |
Q3.2 | 1s |
Q3.3 | 1s |
Q3.4 | 1s |
Q4.1 | 2s |
Q4.2 | 545ms |
Q4.3 | 483ms |
Updated 7 days ago