Typical Hudi Configuration Parameters
This section describes important Hudi configurations. For details, visit the Hudi official website at https://hudi.apache.org/docs/configurations/.
Write Configuration
Parameter |
Description |
Default Value |
---|---|---|
hoodie.datasource.write.table.name |
Name of the Hudi table to which data is written |
None |
hoodie.datasource.write.operation |
Type of the operation for writing data to the Hudi table. Value options are as follows:
|
upsert |
hoodie.datasource.write.table.type |
Type of the Hudi table. This parameter cannot be modified once specified. The value can be MERGE_ON_READ. |
COPY_ON_WRITE |
hoodie.datasource.write.precombine.field |
Merges and reduplicates rows with the same key before write. |
A specific table field |
hoodie.datasource.write.payload.class |
Class used to merge the records to be updated and the updated records during update. This parameter can be customized. You can compile it to implement your merge logic. |
org.apache.hudi.common.model.DefaultHoodieRecordPayload |
hoodie.datasource.write.recordkey.field |
Unique primary key of the Hudi table |
A specific table field |
hoodie.datasource.write.partitionpath.field |
Partition key. This parameter can be used together with hoodie.datasource.write.keygenerator.class to meet different partition needs. |
None |
hoodie.datasource.write.hive_style_partitioning |
Whether to specify a partition mode that is the same as that of Hive. Set this parameter to true. |
true |
hoodie.datasource.write.keygenerator.class |
Used with hoodie.datasource.write.partitionpath.field and hoodie.datasource.write.recordkey.field to generate the primary key and partition mode.
NOTE:
If the value of this parameter is different from that saved in the table, a message is displayed, indicating that the value must be the same. |
org.apache.hudi.keygen.ComplexKeyGenerator |
Configuration of Hive Table Synchronization
Parameter |
Description |
Default Value |
---|---|---|
hoodie.datasource.hive_sync.enable |
Whether to synchronize Hudi tables to Hive MetaStore.
CAUTION:
Set this parameter to true to use Hive to centrally manage Hudi tables. |
false |
hoodie.datasource.hive_sync.database |
Name of the database to be synchronized to Hive |
default |
hoodie.datasource.hive_sync.table |
Name of the table to be synchronized to Hive. Set this parameter to the value of hoodie.datasource.write.table.name. |
unknown |
hoodie.datasource.hive_sync.username |
Username used for Hive synchronization |
hive |
hoodie.datasource.hive_sync.password |
Password used for Hive synchronization |
hive |
hoodie.datasource.hive_sync.jdbcurl |
Specified connection to the Hive JDBC |
"" |
hoodie.datasource.hive_sync.use_jdbc |
Whether to use Hive JDBC to connect to Hive and synchronize Hudi table information. Set this parameter to false to invalidate the JDBC connection configuration. |
true |
hoodie.datasource.hive_sync.partition_fields |
Hive partition columns |
"" |
hoodie.datasource.hive_sync.partition_extractor_class |
Class used to extract Hudi partition column values and convert them into Hive partition columns. |
org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor |
hoodie.datasource.hive_sync.support_timestamp |
If the Hudi table contains a field of the timestamp type, set this parameter to true to synchronize the timestamp type to the Hive metadata. The default value is false, indicating that the timestamp type is converted to bigint during synchronization by default. In this case, an error may occur when you query a Hudi table that contains a field of the timestamp type using SQL statements. |
true |
Index Configuration
Parameter |
Description |
Default Value |
---|---|---|
hoodie.index.class |
Full path of a user-defined index, which must be a subclass of HoodieIndex. When this parameter is specified, the configuration takes precedence over that of hoodie.index.type. |
"" |
hoodie.index.type |
Index type. The default value is BLOOM. The possible options are BLOOM, HBASE, GLOBAL_BLOOM, SIMPLE, and GLOBAL_SIMPLE. The Bloom filter eliminates the dependency on an external system and is stored in the footer of a Parquet data file. |
BLOOM |
hoodie.index.bloom.num_entries |
This is the number of entries to be stored in the bloom filter. Assume the maxParquetFileSize is 128 MB and averageRecordSize is 1024 bytes and hence a total of 130 KB records are recorded in a file. The default (60000) is roughly half of this approximation.
CAUTION:
Setting this very low will generate many false positives and index lookup will have to scan a lot more files than it has to, and setting this to a very high number will increase the size every data file linearly (roughly 4 KB for every 50,000 entries). |
60000 |
hoodie.index.bloom.fpp |
Error rate allowed given the number of entries. This is used to calculate how many bits should be assigned for the bloom filter and the number of hash functions. This is usually set very low (default: 0.000000001) to tradeoff disk space for lower false positives. |
0.000000001 |
hoodie.bloom.index.parallelism |
Parallelism for index lookup, which involves Spark shuffling. By default, it is automatically calculated based on input workload characteristics. |
0 |
hoodie.bloom.index.prune.by.ranges |
When this is set to true, range information from files is used to speed up index lookups. This is particularly helpful if the keys have monotonously increasing prefixes, such as timestamp. |
true |
hoodie.bloom.index.use.caching |
When this is set to true, the input RDD will be cached to speed up index lookup by reducing I/O for computing parallelism or affected partitions. |
true |
hoodie.bloom.index.use.treebased.filter |
When this is set to true, interval tree based file pruning optimization is enabled. This mode speeds up file pruning based on key ranges when compared with the brute-force mode. |
true |
hoodie.bloom.index.bucketized.checking |
When this is set to true, bucketized bloom filtering is enabled. This reduces skew seen in sort based bloom index lookup. |
true |
hoodie.bloom.index.keys.per.bucket |
This parameter is available only if bloomIndexBucketizedChecking is enabled and the index type is BLOOM. This configuration controls the "bucket" size which tracks the number of record-key checks made against a single file and is the unit of work allocated to each partition performing bloom filter lookup. A higher value would amortize the fixed cost of reading the Bloom filter to memory. |
10000000 |
hoodie.bloom.index.update.partition.path |
This parameter is applicable only when the index type is GLOBAL_BLOOM. If this parameter is set to true, an update including the partition path of a record that already exists will result in the insertion of the incoming record into the new partition and the deletion of the original record in the old partition. If this parameter is set to false, the original record will only be updated in the old partition. |
true |
hoodie.index.hbase.zkquorum |
Mandatory. This parameter is available only when the index type is HBase. HBase ZooKeeper quorum URL to be connected. |
None |
hoodie.index.hbase.zkport |
Mandatory. This parameter is available only when the index type is HBase. HBase ZooKeeper quorum port to be connected. |
None |
hoodie.index.hbase.zknode.path |
Mandatory. This parameter is available only when the index type is HBase. It is the root znode that will contain all the znodes created and used by HBase. |
None |
hoodie.index.hbase.table |
Mandatory. This parameter is available only when the index type is HBase. HBase table name to be used as an index. Hudi stores the row_key and [partition_path, fileID, commitTime] mapping in the table. |
None |
Storage Configuration
Parameter |
Description |
Default Value |
---|---|---|
hoodie.parquet.max.file.size |
Specifies the target size for Parquet files generated in Hudi write phases. For DFS, this parameter needs to be aligned with the underlying file system block size for optimal performance. |
120 * 1024 * 1024 byte |
hoodie.parquet.block.size |
Specifies the Parquet page size. Page is the unit of read in a Parquet file. In a block, pages are compressed separately. |
120 * 1024 * 1024 byte |
hoodie.parquet.compression.ratio |
Specifies the expected compression ratio of Parquet data when Hudi attempts to adjust the size of a new Parquet file. If the size of the file generated by bulk_insert is smaller than the expected size, increase the value. |
0.1 |
hoodie.parquet.compression.codec |
Specifies the name of the Parquet compression encoding or decoding mode. The default value is gzip. Possible options are [gzip | snappy | uncompressed | lzo]. |
snappy |
hoodie.logfile.max.size |
Specifies the maximum size of LogFile. It is the maximum size allowed for a log file before it is rolled over to the next version. |
1GB |
hoodie.logfile.data.block.max.size |
Specifies the maximum size of a LogFile data block. It is the maximum size allowed for a single data block to be appended to a log file. It helps to ensure that the data appended to the log file is broken up into sizable blocks to prevent OOM errors. The size should be greater than the JVM memory. |
256MB |
hoodie.logfile.to.parquet.compression.ratio |
Specifies the expected additional compression when records move from log files to Parquet files. It is used for MOR tables to send inserted content into log files and control the size of compacted Parquet files. |
0.35 |
Compaction and Cleaning Configurations
Parameter |
Description |
Default Value |
---|---|---|
hoodie.clean.automatic |
Whether to perform automatic cleanup. |
true |
hoodie.cleaner.policy |
Cleanup policy to be used. Hudi will delete the Parquet file of an old version to reclaim space. Any query or computation referring to this version of the file will fail. You are advised to ensure that the data retention time exceeds the maximum query execution time. |
KEEP_LATEST_COMMITS |
hoodie.cleaner.commits.retained |
Number of commits to retain. Data will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into the number of datasets can be incrementally pulled. |
10 |
hoodie.keep.max.commits |
Number of commits that triggers the archiving operation. |
30 |
hoodie.keep.min.commits |
Number of commits reserved for archiving operations. |
20 |
hoodie.commits.archival.batch |
Number of commit instants read in memory as a batch and archived together. |
10 |
hoodie.parquet.small.file.limit |
The value must be smaller than that of maxFileSize. If maxFileSize is set to 0, this function is disabled. Small files always exist because of the large number of insert records in a partition of batch processing. Hudi provides an option to solve the problem of small files by masking inserts into this partition as updates to existing small files. The size here is the minimum file size that is considered as a "small file size". |
104857600 byte |
hoodie.copyonwrite.insert.split.size |
Parallelism for inserting and writing data. It is the number of inserts grouped for a single partition. Writing out 100 MB files with at least 1 KB records means 100 KB records exist in each file. Overprovision to 500 KB by default. To improve insert latency, adjust the value to match the number of records in a single file. If it is set to a smaller value, the file size will shrink (especially when compactionSmallFileSize is set to 0). |
500000 |
hoodie.copyonwrite.insert.auto.split |
Whether Hudi dynamically computes insertSplitSize based on the last 24 commit metadata. This function is disabled by default. |
true |
hoodie.copyonwrite.record.size.estimate |
Average record size. If specified, Hudi will use this parameter and not compute dynamically based on the last 24 commit metadata. There is no default value. This is critical in computing the insert parallelism and packing inserts into small files. |
1024 |
hoodie.compact.inline |
If this parameter is set to true, compaction is triggered by the ingestion itself right after a commit or delta commit action as part of insert, upsert, or bulk_insert. |
true |
hoodie.compact.inline.max.delta.commits |
Maximum number of delta commits to be retained before inline compression is triggered. |
5 |
hoodie.compaction.lazy.block.read |
When CompactedLogScanner merges all log files, this parameter helps to choose whether the logblocks should be read lazily. Set it to true to use I/O-intensive lazy block read (low memory usage) or false to use memory-intensive immediate block read (high memory usage). |
true |
hoodie.compaction.reverse.log.read |
HoodieLogFormatReader reads a log file in the forward direction from pos=0 to pos=file_length. If this parameter is set to true, Reader reads a log file in reverse direction from pos=file_length to pos=0. |
false |
hoodie.cleaner.parallelism |
Increase this parameter if cleaning becomes slow. |
200 |
hoodie.compaction.strategy |
Which file groups are selected for compaction during each compaction run. By default, Hudi selects the log file with most accumulated unmerged data. |
org.apache.hudi.table.action.compact.strategy. LogFileSizeBasedCompactionStrategy |
hoodie.compaction.target.io |
Number of MBs to spend during compaction run for LogFileSizeBasedCompactionStrategy. This parameter can limit ingestion latency when compaction is run in inline mode. |
500 * 1024 MB |
hoodie.compaction.daybased.target.partitions |
Used by org.apache.hudi.io.compact.strategy.DayBasedCompactionStrategy to denote the number of latest partitions to compact during a compaction run. |
10 |
hoodie.compaction.payload.class |
It needs to be same as class used during insert or upsert. Similar to writing, compaction also uses the record payload class to merge records in the log against each other, merge again with the base file, and produce the final record to be written after compaction. |
org.apache.hudi.common.model.Defaulthoodierecordpayload |
hoodie.schedule.compact.only.inline |
Whether to generate only a compression plan during a write operation. This parameter is valid only when hoodie.compact.inline is set to true. |
false |
hoodie.run.compact.only.inline |
Whether to perform only the compression operation when the run compaction command is executed using SQL. If the compression plan does not exist, no action is needed. |
false |
Single-Table Concurrency Control
Parameter |
Description |
Default Value |
---|---|---|
hoodie.write.lock.provider |
Lock provider. You are advised to set the parameter to org.apache.hudi.hive.HiveMetastoreBasedLockProvider. |
org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider |
hoodie.write.lock.hivemetastore.database |
Hive database. |
None |
hoodie.write.lock.hivemetastore.table |
Hive table name. |
None |
hoodie.write.lock.client.num_retries |
Retry times. |
10 |
hoodie.write.lock.client.wait_time_ms_between_retry |
Retry interval. |
10000 |
hoodie.write.lock.conflict.resolution.strategy |
Lock provider class, which must be a subclass of ConflictResolutionStrategy. |
org.apache.hudi.client.transaction.SimpleConcurrentFileWritesConflictResolutionStrategy |
hoodie.write.lock.zookeeper.base_path |
Path for storing ZNodes. The parameter must be the same for all concurrent write configurations of the same table. |
None |
hoodie.write.lock.zookeeper.lock_key |
ZNode name. It is recommended that the ZNode name be the same as the Hudi table name. |
None |
hoodie.write.lock.zookeeper.connection_timeout_ms |
ZooKeeper connection timeout interval. |
15000 |
hoodie.write.lock.zookeeper.port |
ZooKeeper port. |
None |
hoodie.write.lock.zookeeper.url |
URL of the ZooKeeper. |
None |
hoodie.write.lock.zookeeper.session_timeout_ms |
ZooKeeper session expiration time. |
60000 |
Clustering Configuration
This topic is available for MRS 3.1.3 or later versions only.
Clustering has two strategies: hoodie.clustering.plan.strategy.class and hoodie.clustering.execution.strategy.class. Typically, if hoodie.clustering.plan.strategy.class is set to SparkRecentDaysClusteringPlanStrategy or SparkSizeBasedClusteringPlanStrategy, hoodie.clustering.execution.strategy.class does not need to be specified. However, if hoodie.clustering.plan.strategy.class is set to SparkSingleFileSortPlanStrategy, hoodie.clustering.execution.strategy.class must be set to SparkSingleFileSortExecutionStrategy.
Parameter |
Description |
Default Value |
---|---|---|
hoodie.clustering.inline |
Whether to execute clustering synchronously. |
false |
hoodie.clustering.inline.max.commits |
Number of commits that trigger clustering. |
4 |
hoodie.clustering.async.enabled |
Whether to enable asynchronous clustering.
NOTE:
This parameter is available in MRS 3.3.0-LTS and later versions only. |
false |
hoodie.clustering.async.max.commits |
Number of commits that trigger clustering during asynchronous execution.
NOTE:
This parameter is available in MRS 3.3.0-LTS and later versions only. |
4 |
hoodie.clustering.plan.strategy.target.file.max.bytes |
Maximum size of each file after clustering. |
1024 * 1024 * 1024 byte |
hoodie.clustering.plan.strategy.small.file.limit |
Files smaller than the value of this parameter are clustered. |
300 * 1024 * 1024 byte |
hoodie.clustering.plan.strategy.sort.columns |
Column used by clustering for sorting. |
None |
hoodie.layout.optimize.strategy |
Clustering execution policy. The value can be linear, z-order, or hilbert. |
linear |
hoodie.layout.optimize.enable |
This function must be enabled when z-order and hilbert are used. |
false |
hoodie.clustering.plan.strategy.class |
Strategy class for filtering file groups for clustering. By default, files smaller than the value of hoodie.clustering.plan.strategy.small.file.limit are filtered. |
org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy |
hoodie.clustering.execution.strategy.class |
Strategy class for executing clustering (subclass of RunClusteringStrategy), which is used to define the execution mode of a cluster plan. The default classes sort the file groups in the plan by the specified column and meet the configured target file size. |
org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy |
hoodie.clustering.plan.strategy.max.num.groups |
Maximum number of file groups that can be selected during clustering. A larger value indicates a higher concurrency. |
30 |
hoodie.clustering.plan.strategy.max.bytes.per.group |
Maximum number of data records in each file group involved in clustering |
2 * 1024 * 1024 * 1024 byte |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot