When you have chunks of data ready in the source location, ByteHouse allows you to load them into ByteHouse in one shot.
Currently, ByteHouse supports loading data from Amazon S3, Hive, and ad hoc files.
Supported file formats
- Excel (.xls)
You can upload a local file of size up to 200 MB.
Creating a batch job
The exact details of each source type will be different, but batch job creation has 4 common steps:
To begin, from the data import landing page, select
New Import Job, then choose the corresponding source storage.
Step 1: Select the source data
Source data refers to a table (Hive) or a file folder/path (Amazon S3) which contains the data you want to import from. Storage systems usually require credentials and cluster addresses to connect to them. We use the term data source (aka connection) to store this information. We encrypt your connection’s information and it’s not readable, even by us. Once stored, you can’t retrieve this information, but you can update or delete it. Each connection requires a unique name across the account. Because all the details except the source type and connection name are hidden, give your connection a name that helps you recognise the source you need for the job.
The following sections give instructions for different source storage types.
ByteHouse supports object storage from Amazon S3 and Alibaba OSS.
- Example - Create S3 connection
Only the Access Key and Secret Key are needed to create a new S3 connection. From these, we can determine
the buckets for which the credential has the required (read) access.
- Select S3 prefix
ByteHouse helps you select the S3 folder conveniently, but you can also enter details manually (e.g. in
case you create the job before you have data).
Note: The file in
File Nameis used to extract the source schema. You'll be asked for the file you want to import when you start the ingestion.
- Create Hive connection
|Supported auth modes||Supported transport modes|
- Select Hive table
Step 2: Select the target table
You can either import to an existing table or create a new one.
If you choose to create a new table, the user interface is like
create table in database management. However, there is a schema mapping setting where you specify the mapping from source columns to target columns, one by one. We pre-fill the mapping by comparing the column names. However, we recommend that you review and customise as per your needs.
Step 3: Analyse source schema
We provide a schema extraction feature that helps you retrieve the source schema. The feature works by reading the schema from source metadata for the ones with schema (such as Avro, Parquet, Hive); or inferring the schema from the first few hundred records for the formats which have no schema (such as CSV, JSON). In case the source data is headerless, the column names are in
The feature is for your convenience only, it may not be entirely accurate. We recommend you to review the schema before proceeding.
Step 4: Select the loading type, job name, and confirm
Each job requires a unique name across the account and an optional description.
There are 2 loading types, which tell us how to load data to the target table:
- Full loading: Replace the entire target table with the latest batch source.
- Incremental loading: Append the existing table with new batches according to its partition. ByteHouse replaces existing partitions instead of merging.
Note: If you choose to create a new sink table, the table and the job are not created atomically, but sequentially. There could be a case where we cannot create the job, but we have already created the sink table. If you encounter this, you can go back to
Step 3: Select the target table, change to
Import to an existing table, and select the newly created table.
Viewing batch loading jobs
On the data import landing page, we display two lists: jobs and executions. They give you a high-level view of the jobs and executions, such as name, type, status, and the number of ingested records per job per execution. You can click on any job and execution to go to the job details and execution details page, respectively.
Operating a batch loading job
After successfully creating a job, you will be redirected to the job details page. You can also go to that page by clicking on a job from the landing page. On this page, you can get insights about the job such as sync history, ingestion chart, and configuration. You can also perform the job here.
The job does not start automatically after being created. To start it, choose the
Start button on top-right.
If the job is running, the
Start button will be replaced by
Stop will stop the job.
Sometimes, you may want to edit your import jobs, such as source data or sink table schema change, update schema mapping, change to another sink table, or change loading mode. To edit a job, from the job details page, you can select Edit job in the more menu. The steps for editing are similar to those for creating a new job.
Updated about 1 month ago