Importing data into ByteHouse

ByteHouse allows you to load data from external sources into ByteHouse tables. Data loading logic is represented by Loading Jobs. You can create a loading job in the web console, and trigger it from the web or via an API. Each loading job requires a set of common concepts such as source location, loading mode, source format, sink table, refresh mode and schema mappings. This would help you customise the data loading jobs to your use cases.

Supported source locations

We currently support the following source locations, and are continuously expanding the ecosystem:

  • S3

  • Hive (1.0+)

  • Apache Kafka

  • Confluent Cloud

  • Local file system

Batch loading

Batch loading is applicable when you have chunks of data ready in the source location and wish to load them into ByteHouse in one shot.

Depending on whether the sink table is partitioned or not, different loading modes are provided:

Full loading

Full loading will replace the entire table with the latest batch source data.

Incremental loading

Incremental loading will append new data to the existing sink table according to its partition. ByteHouse will replace existing partitions instead of merging.

Supported file formats

ByteHouse supports the following file formats for bulk loading:

  • CSV

  • JSON (multiline)

  • Avro

  • Parquet

  • Excel (.xls)

Streaming (real-time) loading

ByteHouse can connect to your Kafka source and continuously stream data into your tables. Unlike batch loading, once started, a Kafka job will run continuously. ByteHouse’s Kafka loading provides ‘exactly-once’ semantics. You can stop/resume the job and ByteHouse will keep track of consumed offsets.

Supported message formats

ByteHouse supports the following message formats for streaming loading:

  • Protobuf

  • JSON