Updated on 2025-04-21 GMT+08:00

API Syntax Description

Setting Write Modes

Hudi uses the hoodie.datasource.write.operation parameter to set the write mode.

  • insert: This operation does not require querying specific update file partitions through the index, so it is faster than upsert. You are advised to use this operation when there is no update data. If there is update data, using this operation may result in duplicate data.
  • bulk_insert: This operation sorts the primary key and inserts it into the Hudi table as a regular Parquet table. It has the highest performance but cannot control small files, whereas upsert and insert can control small files well.
  • upsert: The default operation type. Hudi determines whether the data to be inserted contains update data based on the primary key. If it contains updates, it performs an upsert; otherwise, it performs an insert.
  • Because insert does not sort primary keys, you are advised not to use insert when initializing the dataset. Use bulk_insert instead.
  • You are advised to use insert when it is confirmed that all data is new, and upsert when there is update data.

Example: bulk_insert writing a COW non-partitioned table

df.write.format("org.apache.hudi").
option("hoodie.datasource.write.table.type", COW_TABLE_TYPE_OPT_VAL).
option("hoodie.datasource.write.precombine.field", "update_time").
option("hoodie.datasource.write.recordkey.field", "id").
option("hoodie.datasource.write.partitionpath.field", "").
option("hoodie.datasource.write.operation", "bulk_insert").
option("hoodie.table.name", tableName).
option("hoodie.write.lock.provider", "com.huawei.luxor.hudi.util.DliCatalogBasedLockProvider").
option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.NonpartitionedKeyGenerator").
option("hoodie.datasource.hive_sync.enable", "true").
option("hoodie.datasource.hive_sync.partition_fields", "").
option("hoodie.datasource.hive_sync.partition_extractor_class", "org.apache.hudi.hive.NonPartitionedExtractor").
option("hoodie.datasource.hive_sync.database", databaseName).
option("hoodie.datasource.hive_sync.table", tableName).
option("hoodie.datasource.hive_sync.use_jdbc", "false").
option("hoodie.bulkinsert.shuffle.parallelism", 4).
mode(SaveMode.Overwrite).
save(basePath)

Setting Partitions

  • Multi-level partitioning

    Parameter

    Description

    hoodie.datasource.write.partitionpath.field

    Set it to multiple service fields separated by commas (,).

    hoodie.datasource.hive_sync.partition_fields

    Set it to the same as hoodie.datasource.write.partitionpath.field.

    hoodie.datasource.write.keygenerator.class

    Set it to org.apache.hudi.keygen.ComplexKeyGenerator.

    hoodie.datasource.hive_sync.partition_extractor_class

    Set it to org.apache.hudi.hive.MultiPartKeysValueExtractor.

    Example: Creating a multi-level partitioned COW table with partitions p1/p2/p3

    df.write.format("org.apache.hudi").
    option("hoodie.datasource.write.table.type", COW_TABLE_TYPE_OPT_VAL).
    option("hoodie.datasource.write.precombine.field", "update_time").
    option("hoodie.datasource.write.recordkey.field", "id").
    option("hoodie.datasource.write.partitionpath.field", "year,month,day").
    option("hoodie.datasource.write.operation", "bulk_insert").
    option("hoodie.table.name", tableName).
    option("hoodie.write.lock.provider", "com.huawei.luxor.hudi.util.DliCatalogBasedLockProvider").
    option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.ComplexKeyGenerator").
    option("hoodie.datasource.hive_sync.enable", "true").
    option("hoodie.datasource.hive_sync.partition_fields", "year,month,day").
    option("hoodie.datasource.hive_sync.partition_extractor_class", "org.apache.hudi.hive.MultiPartKeysValueExtractor").
    option("hoodie.datasource.hive_sync.database", databaseName).
    option("hoodie.datasource.hive_sync.table", tableName).
    option("hoodie.datasource.hive_sync.use_jdbc", "false").
    mode(SaveMode.Overwrite).
    save(basePath)
  • Single-level partitioning

    Parameter

    Description

    hoodie.datasource.write.partitionpath.field

    Set it to a service field.

    hoodie.datasource.hive_sync.partition_fields

    Set it to the same as hoodie.datasource.write.partitionpath.field.

    hoodie.datasource.write.keygenerator.class

    The default value is org.apache.hudi.keygen.ComplexKeyGenerator.

    You can also set it to org.apache.hudi.keygen.SimpleKeyGenerator.

    If left unspecified, the system automatically uses the default value.

    hoodie.datasource.hive_sync.partition_extractor_class

    Set it to org.apache.hudi.hive.MultiPartKeysValueExtractor.

    Example: Creating a single partitioned MOR table with partition p1

    df.write.format("org.apache.hudi").
    option("hoodie.datasource.write.table.type", MOR_TABLE_TYPE_OPT_VAL).
    option("hoodie.datasource.write.precombine.field", "update_time").
    option("hoodie.datasource.write.recordkey.field", "id").
    option("hoodie.datasource.write.partitionpath.field", "create_time").
    option("hoodie.datasource.write.operation", "bulk_insert").
    option("hoodie.table.name", tableName).
    option("hoodie.write.lock.provider", "com.huawei.luxor.hudi.util.DliCatalogBasedLockProvider").
    option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.ComplexKeyGenerator").
    option("hoodie.datasource.hive_sync.enable", "true").
    option("hoodie.datasource.hive_sync.partition_fields", "create_time").
    option("hoodie.datasource.hive_sync.partition_extractor_class", "org.apache.hudi.hive.MultiPartKeysValueExtractor").
    option("hoodie.datasource.hive_sync.database", databaseName).
    option("hoodie.datasource.hive_sync.table", tableName).
    option("hoodie.datasource.hive_sync.use_jdbc", "false").
    mode(SaveMode.Overwrite).
    save(basePath)
  • Non-partitioning

    Parameter

    Description

    hoodie.datasource.write.partitionpath.field

    Set it to an empty string.

    hoodie.datasource.hive_sync.partition_fields

    Set it to an empty string.

    hoodie.datasource.write.keygenerator.class

    Set it to org.apache.hudi.keygen.NonpartitionedKeyGenerator.

    hoodie.datasource.hive_sync.partition_extractor_class

    Set it to org.apache.hudi.hive.NonPartitionedExtractor.

    Example: Creating a non-partitioned COW table

    df.write.format("org.apache.hudi").
    option("hoodie.datasource.write.table.type", COW_TABLE_TYPE_OPT_VAL).
    option("hoodie.datasource.write.precombine.field", "update_time").
    option("hoodie.datasource.write.recordkey.field", "id").
    option("hoodie.datasource.write.partitionpath.field", "").
    option("hoodie.datasource.write.operation", "bulk_insert").
    option("hoodie.table.name", tableName).
    option("hoodie.write.lock.provider", "com.huawei.luxor.hudi.util.DliCatalogBasedLockProvider").
    option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.NonpartitionedKeyGenerator").
    option("hoodie.datasource.hive_sync.enable", "true").
    option("hoodie.datasource.hive_sync.partition_fields", "").
    option("hoodie.datasource.hive_sync.partition_extractor_class", "org.apache.hudi.hive.NonPartitionedExtractor").
    option("hoodie.datasource.hive_sync.database", databaseName).
    option("hoodie.datasource.hive_sync.table", tableName).
    option("hoodie.datasource.hive_sync.use_jdbc", "false").
    mode(SaveMode.Overwrite).
    save(basePath)
  • Time-date partitioning

    Parameter

    Description

    hoodie.datasource.write.partitionpath.field

    Set it to a value of the date type, in the format of yyyy/mm/dd.

    hoodie.datasource.hive_sync.partition_fields

    Set it to the same as hoodie.datasource.write.partitionpath.field.

    hoodie.datasource.write.keygenerator.class

    The default value is org.apache.hudi.keygen.SimpleKeyGenerator. You can set it to org.apache.hudi.keygen.ComplexKeyGenerator.

    hoodie.datasource.hive_sync.partition_extractor_class

    Set it to org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor.

    The SlashEncodedDayPartitionValueExtractor has the following constraint: The date format must be yyyy/mm/dd.