API Syntax Description

Setting Write Modes

Hudi uses the hoodie.datasource.write.operation parameter to set the write mode.

insert: This operation does not require querying specific update file partitions through the index, so it is faster than upsert. You are advised to use this operation when there is no update data. If there is update data, using this operation may result in duplicate data.
bulk_insert: This operation sorts the primary key and inserts it into the Hudi table as a regular Parquet table. It has the highest performance but cannot control small files, whereas upsert and insert can control small files well.
upsert: The default operation type. Hudi determines whether the data to be inserted contains update data based on the primary key. If it contains updates, it performs an upsert; otherwise, it performs an insert.

Because insert does not sort primary keys, you are advised not to use insert when initializing the dataset. Use bulk_insert instead.
You are advised to use insert when it is confirmed that all data is new, and upsert when there is update data.

Example: bulk_insert writing a COW non-partitioned table

df.write.format("org.apache.hudi").
option("hoodie.datasource.write.table.type", COW_TABLE_TYPE_OPT_VAL).
option("hoodie.datasource.write.precombine.field", "update_time").
option("hoodie.datasource.write.recordkey.field", "id").
option("hoodie.datasource.write.partitionpath.field", "").
option("hoodie.datasource.write.operation", "bulk_insert").
option("hoodie.table.name", tableName).
option("hoodie.write.lock.provider", "com.huawei.luxor.hudi.util.DliCatalogBasedLockProvider").
option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.NonpartitionedKeyGenerator").
option("hoodie.datasource.hive_sync.enable", "true").
option("hoodie.datasource.hive_sync.partition_fields", "").
option("hoodie.datasource.hive_sync.partition_extractor_class", "org.apache.hudi.hive.NonPartitionedExtractor").
option("hoodie.datasource.hive_sync.database", databaseName).
option("hoodie.datasource.hive_sync.table", tableName).
option("hoodie.datasource.hive_sync.use_jdbc", "false").
option("hoodie.bulkinsert.shuffle.parallelism", 4).
mode(SaveMode.Overwrite).
save(basePath)

Setting Partitions

Multi-level partitioning

Parameter	Description
hoodie.datasource.write.partitionpath.field	Set it to multiple service fields separated by commas (,).
hoodie.datasource.hive_sync.partition_fields	Set it to the same as hoodie.datasource.write.partitionpath.field.
hoodie.datasource.write.keygenerator.class	Set it to org.apache.hudi.keygen.ComplexKeyGenerator.
hoodie.datasource.hive_sync.partition_extractor_class	Set it to org.apache.hudi.hive.MultiPartKeysValueExtractor.

Example: Creating a multi-level partitioned COW table with partitions p1/p2/p3

df.write.format("org.apache.hudi").
option("hoodie.datasource.write.table.type", COW_TABLE_TYPE_OPT_VAL).
option("hoodie.datasource.write.precombine.field", "update_time").
option("hoodie.datasource.write.recordkey.field", "id").
option("hoodie.datasource.write.partitionpath.field", "year,month,day").
option("hoodie.datasource.write.operation", "bulk_insert").
option("hoodie.table.name", tableName).
option("hoodie.write.lock.provider", "com.huawei.luxor.hudi.util.DliCatalogBasedLockProvider").
option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.ComplexKeyGenerator").
option("hoodie.datasource.hive_sync.enable", "true").
option("hoodie.datasource.hive_sync.partition_fields", "year,month,day").
option("hoodie.datasource.hive_sync.partition_extractor_class", "org.apache.hudi.hive.MultiPartKeysValueExtractor").
option("hoodie.datasource.hive_sync.database", databaseName).
option("hoodie.datasource.hive_sync.table", tableName).
option("hoodie.datasource.hive_sync.use_jdbc", "false").
mode(SaveMode.Overwrite).
save(basePath)

Single-level partitioning

Parameter	Description
hoodie.datasource.write.partitionpath.field	Set it to a service field.
hoodie.datasource.hive_sync.partition_fields	Set it to the same as hoodie.datasource.write.partitionpath.field.
hoodie.datasource.write.keygenerator.class	The default value is org.apache.hudi.keygen.ComplexKeyGenerator. You can also set it to org.apache.hudi.keygen.SimpleKeyGenerator. If left unspecified, the system automatically uses the default value.
hoodie.datasource.hive_sync.partition_extractor_class	Set it to org.apache.hudi.hive.MultiPartKeysValueExtractor.

Example: Creating a single partitioned MOR table with partition p1

df.write.format("org.apache.hudi").
option("hoodie.datasource.write.table.type", MOR_TABLE_TYPE_OPT_VAL).
option("hoodie.datasource.write.precombine.field", "update_time").
option("hoodie.datasource.write.recordkey.field", "id").
option("hoodie.datasource.write.partitionpath.field", "create_time").
option("hoodie.datasource.write.operation", "bulk_insert").
option("hoodie.table.name", tableName).
option("hoodie.write.lock.provider", "com.huawei.luxor.hudi.util.DliCatalogBasedLockProvider").
option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.ComplexKeyGenerator").
option("hoodie.datasource.hive_sync.enable", "true").
option("hoodie.datasource.hive_sync.partition_fields", "create_time").
option("hoodie.datasource.hive_sync.partition_extractor_class", "org.apache.hudi.hive.MultiPartKeysValueExtractor").
option("hoodie.datasource.hive_sync.database", databaseName).
option("hoodie.datasource.hive_sync.table", tableName).
option("hoodie.datasource.hive_sync.use_jdbc", "false").
mode(SaveMode.Overwrite).
save(basePath)

Non-partitioning

Parameter	Description
hoodie.datasource.write.partitionpath.field	Set it to an empty string.
hoodie.datasource.hive_sync.partition_fields	Set it to an empty string.
hoodie.datasource.write.keygenerator.class	Set it to org.apache.hudi.keygen.NonpartitionedKeyGenerator.
hoodie.datasource.hive_sync.partition_extractor_class	Set it to org.apache.hudi.hive.NonPartitionedExtractor.

Example: Creating a non-partitioned COW table

df.write.format("org.apache.hudi").
option("hoodie.datasource.write.table.type", COW_TABLE_TYPE_OPT_VAL).
option("hoodie.datasource.write.precombine.field", "update_time").
option("hoodie.datasource.write.recordkey.field", "id").
option("hoodie.datasource.write.partitionpath.field", "").
option("hoodie.datasource.write.operation", "bulk_insert").
option("hoodie.table.name", tableName).
option("hoodie.write.lock.provider", "com.huawei.luxor.hudi.util.DliCatalogBasedLockProvider").
option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.NonpartitionedKeyGenerator").
option("hoodie.datasource.hive_sync.enable", "true").
option("hoodie.datasource.hive_sync.partition_fields", "").
option("hoodie.datasource.hive_sync.partition_extractor_class", "org.apache.hudi.hive.NonPartitionedExtractor").
option("hoodie.datasource.hive_sync.database", databaseName).
option("hoodie.datasource.hive_sync.table", tableName).
option("hoodie.datasource.hive_sync.use_jdbc", "false").
mode(SaveMode.Overwrite).
save(basePath)

Time-date partitioning

Parameter	Description
hoodie.datasource.write.partitionpath.field	Set it to a value of the date type, in the format of yyyy/mm/dd.
hoodie.datasource.hive_sync.partition_fields	Set it to the same as hoodie.datasource.write.partitionpath.field.
hoodie.datasource.write.keygenerator.class	The default value is org.apache.hudi.keygen.SimpleKeyGenerator. You can set it to org.apache.hudi.keygen.ComplexKeyGenerator.
hoodie.datasource.hive_sync.partition_extractor_class	Set it to org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor.