ByteHouse allows you to load data from external sources into ByteHouse tables. Data loading logic is represented by the concept called Loading Jobs. You can create a loading job in the web console, and trigger it from the web or via API. Each loading job requires a set of common concepts such as source location, loading mode, source format, sink table, refresh mode and schema mappings. This would help you to customize the data loading jobs to best fit your scenarios.

Supported Source Location

We are currently supporting the following source locations, and continuously expanding the ecosystem:

  • S3
  • Hive (1.0+)
  • Apache Kafka
  • Confluent Cloud
  • Local file system

Batch Loading

Batch loading is applicable when you have chunks of data ready in the source location and wish to load them into ByteHouse in one shot.

Depending on whether the sink table is partitioned, different loading modes are provided:

Full Loading

Full loading would replace the entire table with the latest batch source.

Incremental Loading

Incremental loading would append to the existing sink table with the new batches according to its partition. ByteHouse would replace existing partitions instead of merging.

Supported File Format

The following file formats are supported by ByteHouse in Bulk loading:

  • CSV
  • JSON (multiline)
  • Avro
  • Parquet
  • Excel (xls)

Streaming Loading

ByteHouse is able to connect to your Kafka source and continuously stream data into your tables. Unlike the batch loading, Kafka job once started would be continuously running. ByteHouse Kafka loading provides exactly-once semantics. You can stop/resume the job and ByteHouse would keep track of consumed offsets.

Supported Message Format

The following message formats are supported by Bytehouse in Streamming loading:

  • Protobuf
  • JSON