Batch loading

When you have chunks of data ready in the source location, ByteHouse allows you to load them into ByteHouse in one shot.

Currently, Bytehouse supports loading data from S3, Hive, and Ad-hoc File Upload.

Supported File Format

  • CSV
  • JSON
  • Arvo
  • Parquet
  • Excel (xls)

Creating a Batch Job

Different source types have different details, but in general, they all require 4 steps:

To begin, from the data import landing page, select New Import Jobthen choose the source storage accordingly.

Step 1: Select the source data

Source data refers to a table (HIVE) or a file folder/path (S3) which contains the data you want to import from. Storage systems usually require some information to connect to them, which are usually credentials, cluster addresses. We use the term data source (a.k.a connection) to store such information. Your connections information is encrypted and not readable, even by us. You cannot retrieve such information once stored, but you can only update or delete it. Each connection requires a unique name across the account. Because all the details are hidden, except the source type and connection name, you should give your connection a name that helps you to intuit the source you need for the job.
Jump to one of the sections below according to your source storage.

AWS S3

Create S3 connection

Only the Access Key and Secret Key are needed to create a new S3 connection. From which, we can determine the buckets that the credential has the required (read) access on.

Select S3 prefix

We provide the functionalities which help you to select the S3 folder conveniently, but yet, you can input manually if you want to (e.g. In case you create the job before when your data exists).

Note: The file in File Name is used to extract the source schema, you will be asked the file you want to import when you start the ingestion.

HIVE

Create Hive connection

Supported Auth modes:

  • NOSASL
  • NONE
  • LDAP

Supported Transport modes:

  • BINARY
  • HTTP
Select Hive table

Step 2: Analyze source schema

We provide a schema extraction feature that helps you retrieve the source schema. The feature works by reading the schema from source metadata for the ones with schema (such as Avro, parquet, hive); or inferring the schema from the first few hundred records for the formats which have no schema (such as CSV, JSON). In case the source data is headerless, the column names are in _c0 to _cN format.
The feature is for your convenience only, it might not be precise entirely, you are recommended to review the schema before proceeding.

Step 3: Select the target table

You can either import to an existing table or create a new one.

If you choose to create a new table, the user interface is similar to create table in database management. However, there is a schema mapping setting where you specify the mapping from source columns to target columns, one by one. We prefill the mapping by comparing the column names, yet you are recommended to review and customize per your needs.

Step 4: Select the loading type, job name and confirm

Each job requires a unique name across the account and an optional description.

There are 2 loading types, which tell us how to load data to the sink table:

  • Full loading: Replace the entire sink table with the latest batch source.
  • Incremental loading: Append the existing table with new batches according to its partition. Note: Bytehouse would replace existing partitions instead of merging.

Note: If you choose to create a new sink table, the table and the job are not created atomically, but sequentially. There could be a case where we fail to create the job but the sink table has already been created. If you experience this issue, you can go back to step 3: Select the target table, change to import to an existing table, and select the newly created table.

Viewing batch loading jobs

On the data import landing page, we display two lists: jobs and executions. They give you a high-level view of the jobs and executions, such as name, type, status, number of ingested records per job, and per execution. You can click on any job and execution to go to the job details and execution details page respectively.

Operating a batch loading job

After creating a job successfully, you will be redirected to the job details page. You can also go to that page by clicking on a job from the landing page. On this page, you can get insights of the jobs such as Sync history, ingestion chart, configuration. You can also operate the job here.

Start/Stop job

The job does not start automatically after being created, to start it, you can click on the Start button on top-right.

If the job is running, the Start button will be replaced by Stop, needless to say, that button is for stopping the job.

Edit jobs

There are many cases where you want to edit your import jobs, such as source data or sink table schema change, remap schema mapping, change to another sink table, or change loading mode. The editing steps are similar to creating a new job. To edit a job, from the job details page, you can select Edit job in the more menu.


Did this page help you?