Typical Hudi Configuration Parameters
This section describes important Hudi configurations. For details, visit the Hudi official website at https://hudi.apache.org/cn/docs/0.11.0/configurations/.
- To set Hudi parameters while submitting a DLI Spark SQL job, access the SQL Editor page and click Settings in the upper right corner. In the Parameter Settings area, set the parameters.
- When submitting DLI Spark jar jobs, Hudi parameters can be configured through the Spark datasource API's options.
   Alternatively, you can configure them in Spark Arguments(--conf) when submitting the job. Note that when configuring parameters here, the key needs to have the prefix spark.hadoop., for example, spark.hadoop.hoodie.compact.inline=true. 
Write Configuration
| Parameter | Description | Default Value | 
|---|---|---|
| hoodie.datasource.write.table.name | Name of the Hudi table to write to | None | 
| hoodie.datasource.write.operation | Operation type for writing to the Hudi table. Currently, upsert, delete, insert, and bulk_insert are supported. 
 | upsert | 
| hoodie.datasource.write.table.type | Type of Hudi table. Once specified, this parameter cannot be modified later. Option: MERGE_ON_READ. | COPY_ON_WRITE | 
| hoodie.datasource.write.precombine.field | Merges and reduplicates rows with the same key before write. | A specific table field | 
| hoodie.datasource.write.payload.class | Class used to merge the records to be updated and the updated records during update. This parameter can be customized. You can compile it to implement your merge logic. | org.apache.hudi.common.model.DefaultHoodieRecordPayload | 
| hoodie.datasource.write.recordkey.field | Unique primary key for the Hudi table | A specific table field | 
| hoodie.datasource.write.partitionpath.field | Partition key. This parameter can be used together with hoodie.datasource.write.keygenerator.class to meet different partition needs. | None | 
| hoodie.datasource.write.hive_style_partitioning | Whether to specify a partition mode that is the same as that of Hive. Set it to true. | true | 
| hoodie.datasource.write.keygenerator.class | Used with hoodie.datasource.write.partitionpath.field and hoodie.datasource.write.recordkey.field to generate the primary key and partition mode. 
         NOTE: 
         If the value of this parameter is different from that saved in the table, a message is displayed, indicating that the value must be the same. | org.apache.hudi.keygen.ComplexKeyGenerator | 
Configuration of Hive Table Synchronization
The metadata service provided by DLI is a Hive Metastore service (HMS), so the following parameters are related to synchronizing the metadata service.
| Parameter | Description | Default Value | 
|---|---|---|
| hoodie.datasource.hive_sync.enable | Whether to synchronize Hudi tables to Hive. When using the metadata service provided by DLI, configuring this parameter means synchronizing to the metadata of DLI. 
         CAUTION: 
         You are advised to set it to true to use the metadata service to manage Hudi tables. | false | 
| hoodie.datasource.hive_sync.database | Name of the database to be synchronized to Hive | default | 
| hoodie.datasource.hive_sync.table | Name of the table to be synchronized to Hive. Set it to the value of hoodie.datasource.write.table.name. | unknown | 
| hoodie.datasource.hive_sync.partition_fields | Hive partition columns | "" | 
| hoodie.datasource.hive_sync.partition_extractor_class | Class used to extract Hudi partition column values and convert them into Hive partition columns. | org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor | 
| hoodie.datasource.hive_sync.support_timestamp | If the Hudi table contains a field of the timestamp type, set this parameter to true to synchronize the timestamp type to the Hive metadata. The default value is false, indicating that the timestamp type is converted to bigint during synchronization by default. In this case, an error may occur when you query a Hudi table that contains a field of the timestamp type using SQL statements. | true | 
| hoodie.datasource.hive_sync.username | Username specified when synchronizing Hive using JDBC | hive | 
| hoodie.datasource.hive_sync.password | Password specified when synchronizing Hive using JDBC | hive | 
| hoodie.datasource.hive_sync.jdbcurl | JDBC URL specified for connecting to Hive | "" | 
| hoodie.datasource.hive_sync.use_jdbc | Whether to use Hive JDBC to synchronize Hudi table information to Hive. You are advised to set this parameter to false. When set to false, the JDBC connection-related configuration will be invalid. | true | 
Index Configuration
| Parameter | Description | Default Value | 
|---|---|---|
| hoodie.index.class | Full path of a user-defined index, which must be a subclass of HoodieIndex. When this parameter is specified, the configuration takes precedence over that of hoodie.index.type. | "" | 
| hoodie.index.type | Index type. The default value is BLOOM. Possible options are BLOOM, GLOBAL_BLOOM, SIMPLE, and GLOBAL_SIMPLE. The Bloom filter eliminates the dependency on an external system and is stored in the footer of a Parquet data file. | BLOOM | 
| hoodie.index.bloom.num_entries | Number of entries stored in the Bloom filter. Assuming maxParquetFileSize is 128 MB and averageRecordSize is 1024 bytes, the total number of records in a file is about 130 KB. The default value (60000) is about half of this approximation. 
         CAUTION: 
         Setting this value too low will result in many false positives, and the index lookup will need to scan more files than necessary. Setting it too high will linearly increase the size of each data file (approximately 4 KB for every 50,000 entries). | 60000 | 
| hoodie.index.bloom.fpp | The allowed error rate based on the number of entries. Used to calculate the number of bits to allocate for the Bloom filter and the number of hash functions. This value is typically set very low (default: 0.000000001) to trade off disk space for a lower false positive rate. | 0.000000001 | 
| hoodie.bloom.index.parallelism | Parallelism for index lookup involving Spark Shuffle. By default, it is automatically calculated based on input workload characteristics. | 0 | 
| hoodie.bloom.index.prune.by.ranges | When set to true, file range information can speed up index lookups. This is particularly useful if the keys have a monotonically increasing prefix, such as timestamps. | true | 
| hoodie.bloom.index.use.caching | When set to true, the input RDD is cached to speed up index lookups by reducing IO required for calculating parallelism or affected partitions. | true | 
| hoodie.bloom.index.use.treebased.filter | When set to true, tree-based file filter optimization is enabled. Compared to brute force, this mode speeds up file filtering based on key ranges. | true | 
| hoodie.bloom.index.bucketized.checking | When set to true, bucketized Bloom filtering is enabled. This reduces bias seen in sort-based Bloom index lookups. | true | 
| hoodie.bloom.index.keys.per.bucket | This parameter is available only when bloomIndexBucketizedChecking is enabled and the index type is BLOOM. This configuration controls the size of the "bucket", which tracks the number of record key checks performed on a single file and serves as the work unit assigned to each partition executing Bloom filter lookups. Higher values will amortize the fixed cost of reading the Bloom filter into memory. | 10000000 | 
| hoodie.bloom.index.update.partition.path | This parameter is applicable only when the index type is GLOBAL_BLOOM. When set to true, updating a record that includes a partition path will insert the new record into the new partition and delete the original record from the old partition. When set to false, only the original record in the old partition is updated. | true | 
Storage Configuration
| Parameter | Description | Default Value | 
|---|---|---|
| hoodie.parquet.max.file.size | Target size of the Parquet files generated during the Hudi write phase. For DFS, this should align with the underlying file system block size for optimal performance. | 120 * 1024 * 1024 byte | 
| hoodie.parquet.block.size | Parquet page size, which is the read unit in a parquet file. Pages within a block are compressed separately. | 120 * 1024 * 1024 byte | 
| hoodie.parquet.compression.ratio | Expected compression ratio for Parquet data when Hudi tries to size new parquet files. Increase this value if the files generated by bulk_insert are smaller than expected. | 0.1 | 
| hoodie.parquet.compression.codec | Name of the parquet compression codec. Default is gzip. Possible options are gzip, snappy, uncompressed, and lzo. | snappy | 
| hoodie.logfile.max.size | Maximum size of the LogFile. This is the maximum size allowed before rolling over to a new version of the log file. | 1 GB | 
| hoodie.logfile.data.block.max.size | Maximum size of the LogFile data block. This is the maximum size of a single data block appended to the log file. It helps ensure that data appended to the log file is broken down into manageable blocks to prevent OOM errors. This size should be greater than the JVM memory. | 256 MB | 
| hoodie.logfile.to.parquet.compression.ratio | Expected compression ratio as records move from log files to Parquet. Used in MOR storage to control the size of compressed Parquet files. | 0.35 | 
Compaction and Cleaning Configuration
| Parameter | Description | Default Value | 
|---|---|---|
| hoodie.clean.automatic | Whether to perform automatic cleaning | true | 
| hoodie.cleaner.policy | Cleaning policy to use. Hudi will remove old versions of Parquet files to reclaim space. Any query or computation referencing this version will fail. Ensure data retention exceeds the maximum query execution time. | KEEP_LATEST_COMMITS | 
| hoodie.cleaner.commits.retained | Number of commits to retain. Data will be retained for num_of_commits * time_between_commits (planned), which directly translates to the number of incremental pulls on this dataset. | 10 | 
| hoodie.keep.max.commits | Threshold for the number of commits to trigger archival. | 30 | 
| hoodie.keep.min.commits | Number of commits to retain for archival | 20 | 
| hoodie.commits.archival.batch | Controls the number of commit instants to read and archive together in a batch. | 10 | 
| hoodie.parquet.small.file.limit | Should be less than maxFileSize. If set to 0, this function is disabled. Due to the large number of records inserted into partitions in batch processing, small files will always appear. Hudi provides an option to address the small file problem by treating inserts into the partition as updates to existing small files. The size here is the minimum file size considered a "small file size". | 104857600 bytes | 
| hoodie.copyonwrite.insert.split.size | Parallelism for insert writes. Total number of inserts for a single partition. Writing out 100 MB files, with records at least 1 KB in size, means 100 KB records per file. Default is over-provisioned to 500 KB. Adjust this to match the number of records in a single file to improve insert latency. Setting this value smaller results in smaller files (especially if compactionSmallFileSize is 0). | 500000 | 
| hoodie.copyonwrite.insert.auto.split | Whether Hudi should dynamically calculate insertSplitSize based on the last 24 commits' metadata. Default is false. | true | 
| hoodie.copyonwrite.record.size.estimate | Average record size. If specified, Hudi will use it instead of dynamically calculating based on the last 24 commits' metadata. No default value. Crucial for calculating insert parallelism and packing inserts into small files. | 1024 | 
| hoodie.compact.inline | When set to true, compaction is triggered by the ingestion itself immediately after the insert, upsert, bulk insert, or incremental commit operations. | true | 
| hoodie.compact.inline.max.delta.commits | Maximum number of delta commits to retain before triggering inline compaction. | 5 | 
| hoodie.compaction.lazy.block.read | Helps choose whether to delay reading log blocks when CompactedLogScanner merges all log files. Set it to true for I/O-intensive delayed block reading (low memory usage), or false for memory-intensive immediate block reading (high memory usage). | true | 
| hoodie.compaction.reverse.log.read | HoodieLogFormatReader reads log files forward from pos=0 to pos=file_length. If set to true, the reader reads log files backward from pos=file_length to pos=0. | false | 
| hoodie.cleaner.parallelism | Increase this value if cleaning is slow. | 200 | 
| hoodie.compaction.strategy | Strategy to determine which file groups to compact during each compaction run. By default, Hudi selects log files with the most unmerged data accumulated. | org.apache.hudi.table.action.compact.strategy. LogFileSizeBasedCompactionStrategy | 
| hoodie.compaction.target.io | Amount of MB to spend during the compaction run in LogFileSizeBasedCompactionStrategy. This value helps limit ingestion delays when compaction runs in inline mode. | 500 * 1024 MB | 
| hoodie.compaction.daybased.target.partitions | Used by org.apache.hudi.io.compact.strategy.DayBasedCompactionStrategy, representing the latest number of partitions to compact during the compaction run. | 10 | 
| hoodie.compaction.payload.class | Needs to be the same class used during insert/upsert operations. Like writing, compaction uses the record payload class to merge records from the logs with each other, then with the base file again, and generate the final records to be written post-compaction. | org.apache.hudi.common.model.Defaulthoodierecordpayload | 
| hoodie.schedule.compact.only.inline | Whether to only generate a compaction plan during write operations. Valid when hoodie.compact.inline is true. | false | 
| hoodie.run.compact.only.inline | Whether to only perform compaction operations when executing the run compaction command through SQL. If the compaction plan does not exist, it exits directly. | false | 
Single-Table Concurrency Control
| Parameter | Description | Default Value | 
|---|---|---|
| hoodie.write.lock.provider | Lock provider when metadata is hosted by DLI Recommended value: com.huawei.luxor.hudi.util.DliCatalogBasedLockProvider | Spark SQL and Flink SQL jobs will switch to the corresponding implementation class based on the metadata service. For scenarios where metadata is managed by DLI, use com.huawei.luxor.hudi.util.DliCatalogBasedLockProvider. | 
| hoodie.write.lock.hivemetastore.database | Database in the HMS service | None | 
| hoodie.write.lock.hivemetastore.table | Table name in the HMS service | None | 
| hoodie.write.lock.client.num_retries | Number of retries | 10 | 
| hoodie.write.lock.client.wait_time_ms_between_retry | Retry interval | 10000 | 
| hoodie.write.lock.conflict.resolution.strategy | Lock provider class, must be a subclass of ConflictResolutionStrategy. | org.apache.hudi.client.transaction.SimpleConcurrentFileWritesConflictResolutionStrategy | 
Clustering Configuration
 
 
   There are two strategies in clustering: hoodie.clustering.plan.strategy.class and hoodie.clustering.execution.strategy.class. Typically, when hoodie.clustering.plan.strategy.class is set to SparkRecentDaysClusteringPlanStrategy or SparkSizeBasedClusteringPlanStrategy, there is no need to specify hoodie.clustering.execution.strategy.class. However, when hoodie.clustering.plan.strategy.class is SparkSingleFileSortPlanStrategy, hoodie.clustering.execution.strategy.class should be set to SparkSingleFileSortExecutionStrategy.
| Parameter | Description | Default Value | 
|---|---|---|
| hoodie.clustering.inline | Whether to execute clustering synchronously | false | 
| hoodie.clustering.inline.max.commits | Number of commits to trigger clustering | 4 | 
| hoodie.clustering.async.enabled | Whether to enable asynchronous clustering | false | 
| hoodie.clustering.async.max.commits | Number of commits to trigger asynchronous clustering | 4 | 
| hoodie.clustering.plan.strategy.target.file.max.bytes | Maximum file size after clustering | 1024 * 1024 * 1024 byte | 
| hoodie.clustering.plan.strategy.small.file.limit | Files smaller than this size will be clustered | 300 * 1024 * 1024 byte | 
| hoodie.clustering.plan.strategy.sort.columns | Columns used for sorting in clustering | None | 
| hoodie.layout.optimize.strategy | Clustering execution strategy. The options are linear, z-order, and hilbert. | linear | 
| hoodie.layout.optimize.enable | Set this parameter to true when z-order or hilbert is used. | false | 
| hoodie.clustering.plan.strategy.class | Strategy class for filtering file groups for clustering. By default, files smaller than the value of hoodie.clustering.plan.strategy.small.file.limit are filtered. | org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy | 
| hoodie.clustering.execution.strategy.class | Strategy class for executing clustering (subclass of RunClusteringStrategy), which defines how to execute a clustering plan. The default class sorts the file groups in the plan by specified columns while meeting the target file size configuration. | org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy | 
| hoodie.clustering.plan.strategy.max.num.groups | Maximum number of FileGroups to select for clustering at execution. Higher values increase concurrency. | 30 | 
| hoodie.clustering.plan.strategy.max.bytes.per.group | Maximum data per FileGroup to participate in clustering at execution | 2 * 1024 * 1024 * 1024 byte | 
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.
 
    