Common Hudi Parameters_Using Hudi_Component Operation Guide (LTS) (Ankara Region)

Write Configuration

**Table 1** Write configuration parameters
Parameter	Description	Default Value
hoodie.datasource.write.table.name	Name of the Hudi table to which data is written	None
hoodie.datasource.write.operation	Type of the operation for writing data to the Hudi table. Value options are as follows: upsert: updates and inserts data. delete: deletes data. insert: inserts data. bulk_insert: imports data during initial table creation. Do not use upsert or insert during initial table creation. insert_overwrite: performs insert and overwrite operations on static partitions. insert_overwrite_table: performs insert and overwrite operations on dynamic partitions. It does not immediately delete the entire table or overwrite the table. Instead, it overwrites the metadata of the Hudi table logically, and Hudi deletes useless data through the clean mechanism. Its efficiency is higher than that of the combination of bulk_insert and overwrite.	upsert
hoodie.datasource.write.table.type	Type of the Hudi table. This parameter cannot be modified once specified. The value can be MERGE_ON_READ.	COPY_ON_WRITE
hoodie.datasource.write.precombine.field	Merges and reduplicates rows with the same key before write.	ts
hoodie.datasource.write.payload.class	Class used to merge the records to be updated and the updated records during update. This parameter can be customized. You can compile it to implement your merge logic.	org.apache.hudi.common.model.DefaultHoodieRecordPayload
hoodie.datasource.write.recordkey.field	Unique primary key of the Hudi table	uuid
hoodie.datasource.write.partitionpath.field	Partition key. This parameter can be used together with hoodie.datasource.write.keygenerator.class to meet different partition needs.	None
hoodie.datasource.write.hive_style_partitioning	Whether to specify a partition mode that is the same as that of Hive. Set this parameter to true.	true
hoodie.datasource.write.keygenerator.class	Used with hoodie.datasource.write.partitionpath.field and hoodie.datasource.write.recordkey.field to generate the primary key and partition mode. NOTE: If the value of this parameter is different from that saved in the table, a message is displayed, indicating that the value must be the same.	org.apache.hudi.keygen.ComplexKeyGenerator

Configuration of Hive Table Synchronization

**Table 2** Parameters for synchronizing Hive tables
Parameter	Description	Default Value
hoodie.datasource.hive_sync.enable	Whether to synchronize Hudi tables to Hive MetaStore. CAUTION: Set this parameter to true to use Hive to centrally manage Hudi tables.	false
hoodie.datasource.hive_sync.database	Name of the database to be synchronized to Hive	default
hoodie.datasource.hive_sync.table	Name of the table to be synchronized to Hive. Set this parameter to the value of hoodie.datasource.write.table.name.	unknown
hoodie.datasource.hive_sync.username	Username used for Hive synchronization	hive
hoodie.datasource.hive_sync.password	Password used for Hive synchronization	hive
hoodie.datasource.hive_sync.jdbcurl	Hive JDBC URL for connection	""
hoodie.datasource.hive_sync.use_jdbc	Whether to use Hive JDBC to connect to Hive and synchronize Hudi table information. Set this parameter to false to invalidate the JDBC connection configuration.	true
hoodie.datasource.hive_sync.partition_fields	Hive partition columns	""
hoodie.datasource.hive_sync.partition_extractor_class	Class used to extract Hudi partition column values and convert them into Hive partition columns.	org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor
hoodie.datasource.hive_sync.support_timestamp	If the Hudi table contains fields of the timestamp type, set this parameter to true to synchronize data of the timestamp type to Hive metadata. The default value is false, indicating that the timestamp type is converted to bigint during synchronization by default. In this case, an error may occur when you query a Hudi table that contains a field of the timestamp type using SQL statements.	true

Index Configuration

**Table 3** Index parameters
Parameter	Description	Default Value
hoodie.index.class	Full path of a user-defined index, which must be a subclass of HoodieIndex. When this parameter is specified, the configuration takes precedence over that of hoodie.index.type.	""
hoodie.index.type	Index type. The default value is BLOOM. The possible options are BLOOM, HBASE, GLOBAL_BLOOM, SIMPLE, and GLOBAL_SIMPLE. The Bloom filter eliminates the dependency on an external system and is stored in the footer of a Parquet data file.	BLOOM
hoodie.index.bloom.num_entries	This is the number of entries to be stored in the bloom filter. We assume the maxParquetFileSize is 128 MB and averageRecordSize is 1024 bytes and hence we approx a total of 130 KB records in a file. The default (60000) is roughly half of this approximation. CAUTION: Setting this very low will generate a lot of false positives and index lookup will have to scan a lot more files than it has to and setting this to a very high number will increase the size every data file linearly (roughly 4 KB for every 50,000 entries).	60000
hoodie.index.bloom.fpp	Error rate allowed given the number of entries. This is used to calculate how many bits should be assigned for the bloom filter and the number of hash functions. This is usually set very low (default: 0.000000001), we like to tradeoff disk space for lower false positives.	0.000000001
hoodie.bloom.index.parallelism	Parallelism for index lookup, which involves Spark shuffling. By default, it is automatically calculated based on input workload characteristics.	0
hoodie.bloom.index.prune.by.ranges	When true, range information from files to leveraged speed up index lookups. Particularly helpful, if the key has a monotonously increasing prefix, such as timestamp.	true
hoodie.bloom.index.use.caching	When true, the input RDD will be cached to speed up index lookup by reducing I/O for computing parallelism or affected partitions.	true
hoodie.bloom.index.use.treebased.filter	When true, interval tree based file pruning optimization is enabled. This mode speeds up file pruning based on key ranges when compared with the brute-force mode.	true
hoodie.bloom.index.bucketized.checking	When true, bucketized bloom filtering is enabled. This reduces skew seen in sort based bloom index lookup.	true
hoodie.bloom.index.keys.per.bucket	Only applies if bloomIndexBucketizedChecking is enabled and the index type is BLOOM. This configuration controls the "bucket" size which tracks the number of record-key checks made against a single file and is the unit of work allocated to each partition performing bloom filter lookup. A higher value would amortize the fixed cost of reading a bloom filter to memory.	10000000
hoodie.bloom.index.update.partition.path	This parameter is applicable only when the index type is GLOBAL_BLOOM. If this parameter is set to true, an update including the partition path of a record that already exists will result in the insertion of the incoming record into the new partition and the deletion of the original record in the old partition. If this parameter is set to false, the original record will only be updated in the old partition.	true
hoodie.index.hbase.zkquorum	Mandatory. This parameter is available only when the index type is HBASE. HBase ZooKeeper quorum URL to be connected.	None
hoodie.index.hbase.zkport	Mandatory. This parameter is available only when the index type is HBASE. HBase ZooKeeper quorum port to be connected.	None
hoodie.index.hbase.zknode.path	Mandatory. This parameter is available only when the index type is HBASE. It is the root znode that will contain all the znodes created and used by HBase.	None
hoodie.index.hbase.table	Mandatory. This parameter is available only when the index type is HBASE. HBase table name to be used as an index. Hudi stores the row_key and [partition_path, fileID, commitTime] mapping in the table.	None

Storage Configuration

**Table 4** Storage parameter configuration
Parameter	Description	Default Value
hoodie.parquet.max.file.size	Specifies the target size for Parquet files generated in Hudi write phases. For DFS, this parameter needs to be aligned with the underlying file system block size for optimal performance.	120 * 1024 * 1024 byte
hoodie.parquet.block.size	Specifies the Parquet page size. Page is the unit of read in a Parquet file. In a block, pages are compressed separately.	120 * 1024 * 1024 byte
hoodie.parquet.compression.ratio	Specifies the expected compression ratio of Parquet data when Hudi attempts to adjust the size of a new Parquet file. If the size of the file generated by bulk_insert is smaller than the expected size, increase the value.	0.1
hoodie.parquet.compression.codec	Specifies the name of the Parquet compression encoding or decoding mode. The default value is gzip. Possible options are [gzip \| snappy \| uncompressed \| lzo].	snappy
hoodie.logfile.max.size	Specifies the maximum size of LogFile. It is the maximum size allowed for a log file before it is rolled over to the next version.	1GB
hoodie.logfile.data.block.max.size	Specifies the maximum size of a LogFile data block. It is the maximum size allowed for a single data block to be appended to a log file. It helps to ensure that the data appended to the log file is broken up into sizable blocks to prevent OOM errors. The size should be greater than the JVM memory.	256MB
hoodie.logfile.to.parquet.compression.ratio	Specifies the expected additional compression when records move from log files to Parquet files. It is used for MOR tables to send inserted content into log files and control the size of compacted Parquet files.	0.35

Compaction and Cleaning Configurations

**Table 5** Compaction & cleaning parameter configuration
Parameter	Description	Default Value
hoodie.clean.automatic	Specifies whether to perform automatic cleanup.	true
hoodie.cleaner.policy	Specifies the cleaning policy to be used. Hudi will delete the Parquet file of an old version to reclaim space. Any query or computation referring to this version of the file will fail. You are advised to ensure that the data retention time exceeds the maximum query execution time.	KEEP_LATEST_COMMITS
hoodie.cleaner.commits.retained	Specifies the number of commits to retain. Data will be retained for *num_of_commits time_between_commits** (scheduled). This also directly translates into the number of datasets can be incrementally pulled.	10
hoodie.keep.max.commits	Number of commits that triggers the archiving operation.	30
hoodie.keep.min.commits	Number of commits reserved by the archiving operation.	20
hoodie.commits.archival.batch	This parameter controls the number of commit instants read in memory as a batch and archived together.	10
hoodie.parquet.small.file.limit	The value must be smaller than that of maxFileSize. If maxFileSize is set to 0, this function is disabled. Small files always exist because of the large number of insert records in a partition of batch processing. Hudi provides an option to solve the problem of small files by masking inserts into this partition as updates to existing small files. The size here is the minimum file size that is considered as a "small file size".	104857600 byte
hoodie.copyonwrite.insert.split.size	Specifies the parallelism for inserting and writing data. It is the number of inserts grouped for a single partition. Writing out 100 MB files with at least 1 KB records means 100 KB records exist in each file. Overprovision to 500 KB by default. To improve insert latency, adjust the value to match the number of records in a single file. If it is set to a smaller value, the file size will shrink (especially when compactionSmallFileSize is set to 0).	500000
hoodie.copyonwrite.insert.auto.split	Specifies whether Hudi dynamically computes insertSplitSize based on the last 24 commit metadata. This function is disabled by default.	true
hoodie.copyonwrite.record.size.estimate	Specifies the average record size. If specified, Hudi will use this parameter and not compute dynamically based on the last 24 commit metadata. There is no default value. This is critical in computing the insert parallelism and packing inserts into small files.	1024
hoodie.compact.inline	If this parameter is set to true, compaction is triggered by the ingestion itself right after a commit or delta commit action as part of insert, upsert, or bulk_insert.	true
hoodie.compact.inline.max.delta.commits	Specifies the maximum number of delta commits to be retained before inline compression is triggered.	5
hoodie.compaction.lazy.block.read	When CompactedLogScanner merges all log files, this parameter helps to choose whether the logblocks should be read lazily. Set it to true to use I/O-intensive lazy block read (low memory usage) or false to use memory-intensive immediate block read (high memory usage).	true
hoodie.compaction.reverse.log.read	HoodieLogFormatReader reads a log file in the forward direction from pos=0 to pos=file_length. If this parameter is set to true, Reader reads a log file in reverse direction from pos=file_length to pos=0.	false
hoodie.cleaner.parallelism	Increase this parameter if cleaning becomes slow.	200
hoodie.compaction.strategy	Determines which file groups are selected for compaction during each compaction run. By default, Hudi selects the log file with most accumulated unmerged data.	org.apache.hudi.table.action.compact.strategy. LogFileSizeBasedCompactionStrategy
hoodie.compaction.target.io	Specifies the number of MBs to spend during compaction run for LogFileSizeBasedCompactionStrategy. This parameter can limit ingestion latency when compaction is run in inline mode.	500 * 1024 MB
hoodie.compaction.daybased.target.partitions	Used by org.apache.hudi.io.compact.strategy.DayBasedCompactionStrategy to denote the number of latest partitions to compact during a compaction run.	10
hoodie.compaction.payload.class	It needs to be same as class used during insert or upsert. Similar to writing, compaction also uses the record payload class to merge records in the log against each other, merge again with the base file, and produce the final record to be written after compaction.	org.apache.hudi.common.model.Defaulthoodierecordpayload
hoodie.schedule.compact.only.inline	Specifies whether to generate only a compression plan during a write operation. This parameter is valid only when hoodie.compact.inline is set to true.	false
hoodie.run.compact.only.inline	Specifies whether to perform only the compression operation when the run compaction command is executed using SQL. If the compression plan does not exist, no action is needed.	false

Single-Table Concurrency Control Configuration

**Table 6** Single-table concurrency control configuration
Parameter	Description	Default Value
hoodie.write.lock.provider	Specifies the lock provider. You are advised to set the parameter to org.apache.hudi.hive.HiveMetastoreBasedLockProvider.	org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider
hoodie.write.lock.hivemetastore.database	Specifies the Hive database.	None
hoodie.write.lock.hivemetastore.table	Specifies the Hive table name.	None
hoodie.write.lock.client.num_retries	Specifies the retry times.	10
hoodie.write.lock.client.wait_time_ms_between_retry	Specifies the retry interval.	10000
hoodie.write.lock.conflict.resolution.strategy	Specifies the lock provider class, which must be a subclass of ConflictResolutionStrategy.	org.apache.hudi.client.transaction.SimpleConcurrentFileWritesConflictResolutionStrategy
hoodie.write.lock.zookeeper.base_path	Path for storing ZNodes. The parameter must be the same for all concurrent write configurations of the same table.	None
hoodie.write.lock.zookeeper.lock_key	ZNode name. It is recommended that the ZNode name be the same as the Hudi table name.	None
hoodie.write.lock.zookeeper.connection_timeout_ms	ZooKeeper connection timeout period.	15000
hoodie.write.lock.zookeeper.port	ZooKeeper port number.	None
hoodie.write.lock.zookeeper.url	URL of ZooKeeper.	None
hoodie.write.lock.zookeeper.session_timeout_ms	Session expiration time of ZooKeeper.	60000

Clustering Configuration

Clustering has two strategies: hoodie.clustering.plan.strategy.class and hoodie.clustering.execution.strategy.class. Typically, if hoodie.clustering.plan.strategy.class is set to SparkRecentDaysClusteringPlanStrategy or SparkSizeBasedClusteringPlanStrategy, hoodie.clustering.execution.strategy.class does not need to be specified. However, if hoodie.clustering.plan.strategy.class is set to SparkSingleFileSortPlanStrategy, hoodie.clustering.execution.strategy.class must be set to SparkSingleFileSortExecutionStrategy.

**Table 7** Clustering parameter configuration
Parameter	Description	Default Value
hoodie.clustering.inline	Whether to execute clustering synchronously	false
hoodie.clustering.inline.max.commits	Number of commits that trigger clustering	4
hoodie.clustering.async.enabled	Whether to enable asynchronous clustering	false
hoodie.clustering.async.max.commits	Number of commits that trigger clustering during asynchronous execution	4
hoodie.clustering.plan.strategy.target.file.max.bytes	Maximum size of each file after clustering	1024 * 1024 * 1024 byte
hoodie.clustering.plan.strategy.small.file.limit	Files smaller than this size will be clustered.	300 * 1024 * 1024 byte
hoodie.clustering.plan.strategy.sort.columns	Columns used for sorting in clustering	None
hoodie.layout.optimize.strategy	Clustering execution strategy. Three sorting modes are available: linear, z-order, and hilbert.	linear
hoodie.layout.optimize.enable	Set this parameter to true when z-order or hilbert is used.	false
hoodie.clustering.plan.strategy.class	Strategy class for filtering file groups for clustering. By default, files whose size is less than the value of hoodie.clustering.plan.strategy.small.file.limit are filtered.	org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
hoodie.clustering.execution.strategy.class	Strategy class for executing clustering (subclass of RunClusteringStrategy), which is used to define the execution mode of a cluster plan. The default classes sort the file groups in the plan by the specified column and meet the configured target file size.	org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
hoodie.clustering.plan.strategy.max.num.groups	Maximum number of file groups that can be selected during clustering. A larger value indicates a higher concurrency.	30
hoodie.clustering.plan.strategy.max.bytes.per.group	Maximum number of data records in each file group involved in clustering	2 * 1024 * 1024 * 1024 byte

Common Hudi Parameters