Typical Hudi Configuration Parameters

This section describes important Hudi configurations. For details, visit the Hudi official website at https://hudi.apache.org/cn/docs/0.11.0/configurations/.

To set Hudi parameters while submitting a DLI Spark SQL job, access the SQL Editor page and click Settings in the upper right corner. In the Parameter Settings area, set the parameters.
When submitting DLI Spark jar jobs, Hudi parameters can be configured through the Spark datasource API's options.
Alternatively, you can configure them in Spark Arguments(--conf) when submitting the job. Note that when configuring parameters here, the key needs to have the prefix spark.hadoop., for example, spark.hadoop.hoodie.compact.inline=true.

Write Configuration

**Table 1** Write configuration parameters
Parameter	Description	Default Value
hoodie.datasource.write.table.name	Name of the Hudi table to write to	None
hoodie.datasource.write.operation	Operation type for writing to the Hudi table. Currently, upsert, delete, insert, and bulk_insert are supported. upsert: updates and inserts data. delete: deletes data. insert: inserts data. bulk_insert: imports data during initial table creation. Do not use upsert or insert during initial table creation. insert_overwrite: performs insert and overwrite operations on static partitions. insert_overwrite_table: performs insert and overwrite operations on dynamic partitions. It does not immediately delete the entire table or overwrite the table. Instead, it overwrites the metadata of the Hudi table logically, and Hudi deletes useless data through the clean mechanism. More efficient than bulk_insert + overwrite.	upsert
hoodie.datasource.write.table.type	Type of Hudi table. Once specified, this parameter cannot be modified later. Option: MERGE_ON_READ.	COPY_ON_WRITE
hoodie.datasource.write.precombine.field	Merges and reduplicates rows with the same key before write.	A specific table field
hoodie.datasource.write.payload.class	Class used to merge the records to be updated and the updated records during update. This parameter can be customized. You can compile it to implement your merge logic.	org.apache.hudi.common.model.DefaultHoodieRecordPayload
hoodie.datasource.write.recordkey.field	Unique primary key for the Hudi table	A specific table field
hoodie.datasource.write.partitionpath.field	Partition key. This parameter can be used together with hoodie.datasource.write.keygenerator.class to meet different partition needs.	None
hoodie.datasource.write.hive_style_partitioning	Whether to specify a partition mode that is the same as that of Hive. Set it to true.	true
hoodie.datasource.write.keygenerator.class	Used with hoodie.datasource.write.partitionpath.field and hoodie.datasource.write.recordkey.field to generate the primary key and partition mode. NOTE: If the value of this parameter is different from that saved in the table, a message is displayed, indicating that the value must be the same.	org.apache.hudi.keygen.ComplexKeyGenerator

Configuration of Hive Table Synchronization

The metadata service provided by DLI is a Hive Metastore service (HMS), so the following parameters are related to synchronizing the metadata service.

**Table 2** Parameters for synchronizing Hive tables
Parameter	Description	Default Value
hoodie.datasource.hive_sync.enable	Whether to synchronize Hudi tables to Hive. When using the metadata service provided by DLI, configuring this parameter means synchronizing to the metadata of DLI. CAUTION: You are advised to set it to true to use the metadata service to manage Hudi tables.	false
hoodie.datasource.hive_sync.database	Name of the database to be synchronized to Hive	default
hoodie.datasource.hive_sync.table	Name of the table to be synchronized to Hive. Set it to the value of hoodie.datasource.write.table.name.	unknown
hoodie.datasource.hive_sync.partition_fields	Hive partition columns	""
hoodie.datasource.hive_sync.partition_extractor_class	Class used to extract Hudi partition column values and convert them into Hive partition columns.	org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor
hoodie.datasource.hive_sync.support_timestamp	If the Hudi table contains a field of the timestamp type, set this parameter to true to synchronize the timestamp type to the Hive metadata. The default value is false, indicating that the timestamp type is converted to bigint during synchronization by default. In this case, an error may occur when you query a Hudi table that contains a field of the timestamp type using SQL statements.	true
hoodie.datasource.hive_sync.username	Username specified when synchronizing Hive using JDBC	hive
hoodie.datasource.hive_sync.password	Password specified when synchronizing Hive using JDBC	hive
hoodie.datasource.hive_sync.jdbcurl	JDBC URL specified for connecting to Hive	""
hoodie.datasource.hive_sync.use_jdbc	Whether to use Hive JDBC to synchronize Hudi table information to Hive. You are advised to set this parameter to false. When set to false, the JDBC connection-related configuration will be invalid.	true

Index Configuration

**Table 3** Index parameters
Parameter	Description	Default Value
hoodie.index.class	Full path of a user-defined index, which must be a subclass of HoodieIndex. When this parameter is specified, the configuration takes precedence over that of hoodie.index.type.	""
hoodie.index.type	Index type. The default value is BLOOM. Possible options are BLOOM, GLOBAL_BLOOM, SIMPLE, and GLOBAL_SIMPLE. The Bloom filter eliminates the dependency on an external system and is stored in the footer of a Parquet data file.	BLOOM
hoodie.index.bloom.num_entries	Number of entries stored in the Bloom filter. Assuming maxParquetFileSize is 128 MB and averageRecordSize is 1024 bytes, the total number of records in a file is about 130 KB. The default value (60000) is about half of this approximation. CAUTION: Setting this value too low will result in many false positives, and the index lookup will need to scan more files than necessary. Setting it too high will linearly increase the size of each data file (approximately 4 KB for every 50,000 entries).	60000
hoodie.index.bloom.fpp	The allowed error rate based on the number of entries. Used to calculate the number of bits to allocate for the Bloom filter and the number of hash functions. This value is typically set very low (default: 0.000000001) to trade off disk space for a lower false positive rate.	0.000000001
hoodie.bloom.index.parallelism	Parallelism for index lookup involving Spark Shuffle. By default, it is automatically calculated based on input workload characteristics.	0
hoodie.bloom.index.prune.by.ranges	When set to true, file range information can speed up index lookups. This is particularly useful if the keys have a monotonically increasing prefix, such as timestamps.	true
hoodie.bloom.index.use.caching	When set to true, the input RDD is cached to speed up index lookups by reducing IO required for calculating parallelism or affected partitions.	true
hoodie.bloom.index.use.treebased.filter	When set to true, tree-based file filter optimization is enabled. Compared to brute force, this mode speeds up file filtering based on key ranges.	true
hoodie.bloom.index.bucketized.checking	When set to true, bucketized Bloom filtering is enabled. This reduces bias seen in sort-based Bloom index lookups.	true
hoodie.bloom.index.keys.per.bucket	This parameter is available only wehn bloomIndexBucketizedChecking is enabled and the index type is BLOOM. This configuration controls the size of the "bucket", which tracks the number of record key checks performed on a single file and serves as the work unit assigned to each partition executing Bloom filter lookups. Higher values will amortize the fixed cost of reading the Bloom filter into memory.	10000000
hoodie.bloom.index.update.partition.path	This parameter is applicable only when the index type is GLOBAL_BLOOM. When set to true, updating a record that includes a partition path will insert the new record into the new partition and delete the original record from the old partition. When set to false, only the original record in the old partition is updated.	true

Storage Configuration

**Table 4** Storage parameter configuration
Parameter	Description	Default Value
hoodie.parquet.max.file.size	Target size of the Parquet files generated during the Hudi write phase. For DFS, this should align with the underlying file system block size for optimal performance.	120 * 1024 * 1024 byte
hoodie.parquet.block.size	Parquet page size, which is the read unit in a parquet file. Pages within a block are compressed separately.	120 * 1024 * 1024 byte
hoodie.parquet.compression.ratio	Expected compression ratio for Parquet data when Hudi tries to size new parquet files. Increase this value if the files generated by bulk_insert are smaller than expected.	0.1
hoodie.parquet.compression.codec	Name of the parquet compression codec. Default is gzip. Possible options are gzip, snappy, uncompressed, and lzo.	snappy
hoodie.logfile.max.size	Maximum size of the LogFile. This is the maximum size allowed before rolling over to a new version of the log file.	1 GB
hoodie.logfile.data.block.max.size	Maximum size of the LogFile data block. This is the maximum size of a single data block appended to the log file. It helps ensure that data appended to the log file is broken down into manageable blocks to prevent OOM errors. This size should be greater than the JVM memory.	256 MB
hoodie.logfile.to.parquet.compression.ratio	Expected compression ratio as records move from log files to Parquet. Used in MOR storage to control the size of compressed Parquet files.	0.35

Compaction and Cleaning Configuration

**Table 5** Compaction and cleaning parameters
Parameter	Description	Default Value
hoodie.clean.automatic	Whether to perform automatic cleaning	true
hoodie.cleaner.policy	Cleaning policy to use. Hudi will remove old versions of Parquet files to reclaim space. Any query or computation referencing this version will fail. Ensure data retention exceeds the maximum query execution time.	KEEP_LATEST_COMMITS
hoodie.cleaner.commits.retained	Number of commits to retain. Data will be retained for *num_of_commits time_between_commits** (planned), which directly translates to the number of incremental pulls on this dataset.	10
hoodie.keep.max.commits	Threshold for the number of commits to trigger archival.	30
hoodie.keep.min.commits	Number of commits to retain for archival	20
hoodie.commits.archival.batch	Controls the number of commit instants to read and archive together in a batch.	10
hoodie.parquet.small.file.limit	Should be less than maxFileSize. If set to 0, this function is disabled. Due to the large number of records inserted into partitions in batch processing, small files will always appear. Hudi provides an option to address the small file problem by treating inserts into the partition as updates to existing small files. The size here is the minimum file size considered a "small file size".	104857600 bytes
hoodie.copyonwrite.insert.split.size	Parallelism for insert writes. Total number of inserts for a single partition. Writing out 100 MB files, with records at least 1 KB in size, means 100 KB records per file. Default is over-provisioned to 500 KB. Adjust this to match the number of records in a single file to improve insert latency. Setting this value smaller results in smaller files (especially if compactionSmallFileSize is 0).	500000
hoodie.copyonwrite.insert.auto.split	Whether Hudi should dynamically calculate insertSplitSize based on the last 24 commits' metadata. Default is false.	true
hoodie.copyonwrite.record.size.estimate	Average record size. If specified, Hudi will use it instead of dynamically calculating based on the last 24 commits' metadata. No default value. Crucial for calculating insert parallelism and packing inserts into small files.	1024
hoodie.compact.inline	When set to true, compaction is triggered by the ingestion itself immediately after the insert, upsert, bulk insert, or incremental commit operations.	true
hoodie.compact.inline.max.delta.commits	Maximum number of delta commits to retain before triggering inline compaction.	5
hoodie.compaction.lazy.block.read	Helps choose whether to delay reading log blocks when CompactedLogScanner merges all log files. Set it to true for I/O-intensive delayed block reading (low memory usage), or false for memory-intensive immediate block reading (high memory usage).	true
hoodie.compaction.reverse.log.read	HoodieLogFormatReader reads log files forward from pos=0 to pos=file_length. If set to true, the reader reads log files backward from pos=file_length to pos=0.	false
hoodie.cleaner.parallelism	Increase this value if cleaning is slow.	200
hoodie.compaction.strategy	Strategy to determine which file groups to compact during each compaction run. By default, Hudi selects log files with the most unmerged data accumulated.	org.apache.hudi.table.action.compact.strategy. LogFileSizeBasedCompactionStrategy
hoodie.compaction.target.io	Amount of MB to spend during the compaction run in LogFileSizeBasedCompactionStrategy. This value helps limit ingestion delays when compaction runs in inline mode.	500 * 1024 MB
hoodie.compaction.daybased.target.partitions	Used by org.apache.hudi.io.compact.strategy.DayBasedCompactionStrategy, representing the latest number of partitions to compact during the compaction run.	10
hoodie.compaction.payload.class	Needs to be the same class used during insert/upsert operations. Like writing, compaction uses the record payload class to merge records from the logs with each other, then with the base file again, and generate the final records to be written post-compaction.	org.apache.hudi.common.model.Defaulthoodierecordpayload
hoodie.schedule.compact.only.inline	Whether to only generate a compaction plan during write operations. Valid when hoodie.compact.inline is true.	false
hoodie.run.compact.only.inline	Whether to only perform compaction operations when executing the run compaction command through SQL. If the compaction plan does not exist, it exits directly.	false

Single-Table Concurrency Control

**Table 6** Single-table concurrency control configuration
Parameter	Description	Default Value
hoodie.write.lock.provider	Lock provider. In scenarios where metadata is managed by DLI, you are advised to set it to com.huawei.luxor.hudi.util.DliCatalogBasedLockProvider.	Spark SQL and Flink SQL jobs will switch to the corresponding implementation class based on the metadata service. For scenarios where metadata is managed by DLI, use com.huawei.luxor.hudi.util.DliCatalogBasedLockProvider.
hoodie.write.lock.hivemetastore.database	Database in the HMS service	None
hoodie.write.lock.hivemetastore.table	Table name in the HMS service	None
hoodie.write.lock.client.num_retries	Number of retries	10
hoodie.write.lock.client.wait_time_ms_between_retry	Retry interval	10000
hoodie.write.lock.conflict.resolution.strategy	Lock provider class, must be a subclass of ConflictResolutionStrategy.	org.apache.hudi.client.transaction.SimpleConcurrentFileWritesConflictResolutionStrategy

Clustering Configuration

There are two strategies in clustering: hoodie.clustering.plan.strategy.class and hoodie.clustering.execution.strategy.class. Typically, when hoodie.clustering.plan.strategy.class is set to SparkRecentDaysClusteringPlanStrategy or SparkSizeBasedClusteringPlanStrategy, there is no need to specify hoodie.clustering.execution.strategy.class. However, when hoodie.clustering.plan.strategy.class is SparkSingleFileSortPlanStrategy, hoodie.clustering.execution.strategy.class should be set to SparkSingleFileSortExecutionStrategy.

**Table 7** Clustering parameters
Parameter	Description	Default Value
hoodie.clustering.inline	Whether to execute clustering synchronously	false
hoodie.clustering.inline.max.commits	Number of commits to trigger clustering	4
hoodie.clustering.async.enabled	Whether to enable asynchronous clustering	false
hoodie.clustering.async.max.commits	Number of commits to trigger asynchronous clustering	4
hoodie.clustering.plan.strategy.target.file.max.bytes	Maximum file size after clustering	1024 * 1024 * 1024 byte
hoodie.clustering.plan.strategy.small.file.limit	Files smaller than this size will be clustered	300 * 1024 * 1024 byte
hoodie.clustering.plan.strategy.sort.columns	Columns used for sorting in clustering	None
hoodie.layout.optimize.strategy	Clustering execution strategy. The options are linear, z-order, and hilbert.	linear
hoodie.layout.optimize.enable	Set this parameter to true when z-order or hilbert is used.	false
hoodie.clustering.plan.strategy.class	Strategy class for filtering file groups for clustering. By default, files smaller than the value of hoodie.clustering.plan.strategy.small.file.limit are filtered.	org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
hoodie.clustering.execution.strategy.class	Strategy class for executing clustering (subclass of RunClusteringStrategy), which defines how to execute a clustering plan. The default class sorts the file groups in the plan by specified columns while meeting the target file size configuration.	org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
hoodie.clustering.plan.strategy.max.num.groups	Maximum number of FileGroups to select for clustering at execution. Higher values increase concurrency.	30
hoodie.clustering.plan.strategy.max.bytes.per.group	Maximum data per FileGroup to participate in clustering at execution	2 * 1024 * 1024 * 1024 byte