API Syntax Description
Setting Write Modes
Hudi uses the hoodie.datasource.write.operation parameter to set the write mode.
- insert: This operation does not require querying specific update file partitions through the index, so it is faster than upsert. You are advised to use this operation when there is no update data. If there is update data, using this operation may result in duplicate data.
- bulk_insert: This operation sorts the primary key and inserts it into the Hudi table as a regular Parquet table. It has the highest performance but cannot control small files, whereas upsert and insert can control small files well.
- upsert: The default operation type. Hudi determines whether the data to be inserted contains update data based on the primary key. If it contains updates, it performs an upsert; otherwise, it performs an insert.

- Because insert does not sort primary keys, you are advised not to use insert when initializing the dataset. Use bulk_insert instead.
- You are advised to use insert when it is confirmed that all data is new, and upsert when there is update data.
Example: bulk_insert writing a COW non-partitioned table
df.write.format("org.apache.hudi"). option("hoodie.datasource.write.table.type", COW_TABLE_TYPE_OPT_VAL). option("hoodie.datasource.write.precombine.field", "update_time"). option("hoodie.datasource.write.recordkey.field", "id"). option("hoodie.datasource.write.partitionpath.field", ""). option("hoodie.datasource.write.operation", "bulk_insert"). option("hoodie.table.name", tableName). option("hoodie.write.lock.provider", "com.huawei.luxor.hudi.util.DliCatalogBasedLockProvider"). option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.NonpartitionedKeyGenerator"). option("hoodie.datasource.hive_sync.enable", "true"). option("hoodie.datasource.hive_sync.partition_fields", ""). option("hoodie.datasource.hive_sync.partition_extractor_class", "org.apache.hudi.hive.NonPartitionedExtractor"). option("hoodie.datasource.hive_sync.database", databaseName). option("hoodie.datasource.hive_sync.table", tableName). option("hoodie.datasource.hive_sync.use_jdbc", "false"). option("hoodie.bulkinsert.shuffle.parallelism", 4). mode(SaveMode.Overwrite). save(basePath)
Setting Partitions
- Multi-level partitioning
Parameter
Description
hoodie.datasource.write.partitionpath.field
Set it to multiple service fields separated by commas (,).
hoodie.datasource.hive_sync.partition_fields
Set it to the same as hoodie.datasource.write.partitionpath.field.
hoodie.datasource.write.keygenerator.class
Set it to org.apache.hudi.keygen.ComplexKeyGenerator.
hoodie.datasource.hive_sync.partition_extractor_class
Set it to org.apache.hudi.hive.MultiPartKeysValueExtractor.
Example: Creating a multi-level partitioned COW table with partitions p1/p2/p3
df.write.format("org.apache.hudi"). option("hoodie.datasource.write.table.type", COW_TABLE_TYPE_OPT_VAL). option("hoodie.datasource.write.precombine.field", "update_time"). option("hoodie.datasource.write.recordkey.field", "id"). option("hoodie.datasource.write.partitionpath.field", "year,month,day"). option("hoodie.datasource.write.operation", "bulk_insert"). option("hoodie.table.name", tableName). option("hoodie.write.lock.provider", "com.huawei.luxor.hudi.util.DliCatalogBasedLockProvider"). option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.ComplexKeyGenerator"). option("hoodie.datasource.hive_sync.enable", "true"). option("hoodie.datasource.hive_sync.partition_fields", "year,month,day"). option("hoodie.datasource.hive_sync.partition_extractor_class", "org.apache.hudi.hive.MultiPartKeysValueExtractor"). option("hoodie.datasource.hive_sync.database", databaseName). option("hoodie.datasource.hive_sync.table", tableName). option("hoodie.datasource.hive_sync.use_jdbc", "false"). mode(SaveMode.Overwrite). save(basePath)
- Single-level partitioning
Parameter
Description
hoodie.datasource.write.partitionpath.field
Set it to a service field.
hoodie.datasource.hive_sync.partition_fields
Set it to the same as hoodie.datasource.write.partitionpath.field.
hoodie.datasource.write.keygenerator.class
The default value is org.apache.hudi.keygen.ComplexKeyGenerator.
You can also set it to org.apache.hudi.keygen.SimpleKeyGenerator.
If left unspecified, the system automatically uses the default value.
hoodie.datasource.hive_sync.partition_extractor_class
Set it to org.apache.hudi.hive.MultiPartKeysValueExtractor.
Example: Creating a single partitioned MOR table with partition p1
df.write.format("org.apache.hudi"). option("hoodie.datasource.write.table.type", MOR_TABLE_TYPE_OPT_VAL). option("hoodie.datasource.write.precombine.field", "update_time"). option("hoodie.datasource.write.recordkey.field", "id"). option("hoodie.datasource.write.partitionpath.field", "create_time"). option("hoodie.datasource.write.operation", "bulk_insert"). option("hoodie.table.name", tableName). option("hoodie.write.lock.provider", "com.huawei.luxor.hudi.util.DliCatalogBasedLockProvider"). option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.ComplexKeyGenerator"). option("hoodie.datasource.hive_sync.enable", "true"). option("hoodie.datasource.hive_sync.partition_fields", "create_time"). option("hoodie.datasource.hive_sync.partition_extractor_class", "org.apache.hudi.hive.MultiPartKeysValueExtractor"). option("hoodie.datasource.hive_sync.database", databaseName). option("hoodie.datasource.hive_sync.table", tableName). option("hoodie.datasource.hive_sync.use_jdbc", "false"). mode(SaveMode.Overwrite). save(basePath)
- Non-partitioning
Parameter
Description
hoodie.datasource.write.partitionpath.field
Set it to an empty string.
hoodie.datasource.hive_sync.partition_fields
Set it to an empty string.
hoodie.datasource.write.keygenerator.class
Set it to org.apache.hudi.keygen.NonpartitionedKeyGenerator.
hoodie.datasource.hive_sync.partition_extractor_class
Set it to org.apache.hudi.hive.NonPartitionedExtractor.
Example: Creating a non-partitioned COW table
df.write.format("org.apache.hudi"). option("hoodie.datasource.write.table.type", COW_TABLE_TYPE_OPT_VAL). option("hoodie.datasource.write.precombine.field", "update_time"). option("hoodie.datasource.write.recordkey.field", "id"). option("hoodie.datasource.write.partitionpath.field", ""). option("hoodie.datasource.write.operation", "bulk_insert"). option("hoodie.table.name", tableName). option("hoodie.write.lock.provider", "com.huawei.luxor.hudi.util.DliCatalogBasedLockProvider"). option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.NonpartitionedKeyGenerator"). option("hoodie.datasource.hive_sync.enable", "true"). option("hoodie.datasource.hive_sync.partition_fields", ""). option("hoodie.datasource.hive_sync.partition_extractor_class", "org.apache.hudi.hive.NonPartitionedExtractor"). option("hoodie.datasource.hive_sync.database", databaseName). option("hoodie.datasource.hive_sync.table", tableName). option("hoodie.datasource.hive_sync.use_jdbc", "false"). mode(SaveMode.Overwrite). save(basePath)
- Time-date partitioning
Parameter
Description
hoodie.datasource.write.partitionpath.field
Set it to a value of the date type, in the format of yyyy/mm/dd.
hoodie.datasource.hive_sync.partition_fields
Set it to the same as hoodie.datasource.write.partitionpath.field.
hoodie.datasource.write.keygenerator.class
The default value is org.apache.hudi.keygen.SimpleKeyGenerator. You can set it to org.apache.hudi.keygen.ComplexKeyGenerator.
hoodie.datasource.hive_sync.partition_extractor_class
Set it to org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor.
The SlashEncodedDayPartitionValueExtractor has the following constraint: The date format must be yyyy/mm/dd.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.