Updated on 2024-08-30 GMT+08:00

SparkSQL table creation parameter specifications

The rules

  • When creating a table, you must specify primaryKey and preCombineField.

    Hudi tables provide the data update and idempotent write capabilities. This capability requires that primary keys must be set for data records to identify duplicate data and update operations. If the primary key is not specified, the table will lose the data update capability. If the preCombineField parameter is not specified, duplicate primary keys will occur.

Parameter name

Parameter Description

Input Value

Description

primaryKey

primary key of hudi

On Demand

It must be specified. It can be a composite primary key but must be globally unique.

preCombineField

Pre-combination key. Multiple data records with the same primary key are merged based on this field.

On demand

This parameter is mandatory. Data with the same primary key will be merged by this field. You cannot specify multiple fields.

  • Do not set hoodie.datasource.hive_sync.enable to false during table creation.

    If this parameter is set to false, newly written partitions cannot be synchronized to Hive Metastore. The query engine loses data when reading the data because the newly written partition information is missing.

  • Do not set the Hudi index type to INMEMORY.

    This index is for test use only. Using the index in the production environment will cause duplicate data.

Creating an example

create table data_partition(id int, comb int, col0 int, yy int, mm int, dd int)
using hudi -- Specify the hudi data source.
partitioned by(yyy, mm, dd) --Specify the partition. Multi-level partitioning is supported.
location '/opt/log/data_partition' --Specify the path. If the table is not created in Hive Warehouse, the table is created.
options(
type='mor', --Table type: mor or cow
primaryKey='id', --primary key, which can be a compound primary key but must be globally unique.
preCombineField='comb' --Pre-combined field. Data with the same primary key will be merged by this field. Currently, only one field cannot be specified.
)