Importing data into ByteHouse
ByteHouse allows you to load data from external sources into ByteHouse tables. Data loading logic is represented by Loading Jobs
. You can create a loading job in the web console, and trigger it from the web or via an API. Each loading job requires a set of common concepts such as source location, loading mode, source format, sink table, refresh mode and schema mappings. This would help you customise the data loading jobs to your use cases.
Supported source locations
We currently support the following source locations, and are continuously expanding the ecosystem:
-
S3
-
Hive (1.0+)
-
Apache Kafka
-
Confluent Cloud
-
Local file system
Batch loading
Batch loading is applicable when you have chunks of data ready in the source location and wish to load them into ByteHouse in one shot.
Depending on whether the sink table is partitioned or not, different loading modes are provided:
Full loading
Full loading will replace the entire table with the latest batch source data.
Incremental loading
Incremental loading will append new data to the existing sink table according to its partition. ByteHouse will replace existing partitions instead of merging.
Supported file formats
ByteHouse supports the following file formats for bulk loading:
-
CSV
-
JSON (multiline)
-
Avro
-
Parquet
-
Excel (.xls)
Streaming (real-time) loading
ByteHouse can connect to your Kafka source and continuously stream data into your tables. Unlike batch loading, once started, a Kafka job will run continuously. ByteHouse’s Kafka loading provides ‘exactly-once’ semantics. You can stop/resume the job and ByteHouse will keep track of consumed offsets.
Supported message formats
ByteHouse supports the following message formats for streaming loading:
-
Protobuf
-
JSON
Updated 3 months ago