Updated on 2024-11-29 GMT+08:00

Common Hudi Parameters

This section describes important Hudi configurations.

Write Configuration

Table 1 Write configuration parameters

Parameter

Description

Default Value

hoodie.datasource.write.table.name

Name of the Hudi table to which data is written

None

hoodie.datasource.write.operation

Type of the operation for writing data to the Hudi table. Value options are as follows:

  • upsert: updates and inserts data.
  • delete: deletes data.
  • insert: inserts data.
  • bulk_insert: imports data during initial table creation. Do not use upsert or insert during initial table creation.
  • insert_overwrite: performs insert and overwrite operations on static partitions.
  • insert_overwrite_table: performs insert and overwrite operations on dynamic partitions. It does not immediately delete the entire table or overwrite the table. Instead, it overwrites the metadata of the Hudi table logically, and Hudi deletes useless data through the clean mechanism. Its efficiency is higher than that of the combination of bulk_insert and overwrite.

upsert

hoodie.datasource.write.table.type

Type of the Hudi table. This parameter cannot be modified once specified. The value can be MERGE_ON_READ.

COPY_ON_WRITE

hoodie.datasource.write.precombine.field

Merges and reduplicates rows with the same key before write.

ts

hoodie.datasource.write.payload.class

Class used to merge the records to be updated and the updated records during update. This parameter can be customized. You can compile it to implement your merge logic.

org.apache.hudi.common.model.DefaultHoodieRecordPayload

hoodie.datasource.write.recordkey.field

Unique primary key of the Hudi table

uuid

hoodie.datasource.write.partitionpath.field

Partition key. This parameter can be used together with hoodie.datasource.write.keygenerator.class to meet different partition needs.

None

hoodie.datasource.write.hive_style_partitioning

Whether to specify a partition mode that is the same as that of Hive. Set this parameter to true.

true

hoodie.datasource.write.keygenerator.class

Used with hoodie.datasource.write.partitionpath.field and hoodie.datasource.write.recordkey.field to generate the primary key and partition mode.

NOTE:

If the value of this parameter is different from that saved in the table, a message is displayed, indicating that the value must be the same.

org.apache.hudi.keygen.ComplexKeyGenerator

Configuration of Hive Table Synchronization

Table 2 Parameters for synchronizing Hive tables

Parameter

Description

Default Value

hoodie.datasource.hive_sync.enable

Whether to synchronize Hudi tables to Hive MetaStore.

CAUTION:

Set this parameter to true to use Hive to centrally manage Hudi tables.

false

hoodie.datasource.hive_sync.database

Name of the database to be synchronized to Hive

default

hoodie.datasource.hive_sync.table

Name of the table to be synchronized to Hive. Set this parameter to the value of hoodie.datasource.write.table.name.

unknown

hoodie.datasource.hive_sync.username

Username used for Hive synchronization

hive

hoodie.datasource.hive_sync.password

Password used for Hive synchronization

hive

hoodie.datasource.hive_sync.jdbcurl

Hive JDBC URL for connection

""

hoodie.datasource.hive_sync.use_jdbc

Whether to use Hive JDBC to connect to Hive and synchronize Hudi table information. Set this parameter to false to invalidate the JDBC connection configuration.

true

hoodie.datasource.hive_sync.partition_fields

Hive partition columns

""

hoodie.datasource.hive_sync.partition_extractor_class

Class used to extract Hudi partition column values and convert them into Hive partition columns.

org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor

hoodie.datasource.hive_sync.support_timestamp

If the Hudi table contains fields of the timestamp type, set this parameter to true to synchronize data of the timestamp type to Hive metadata. The default value is false, indicating that the timestamp type is converted to bigint during synchronization by default. In this case, an error may occur when you query a Hudi table that contains a field of the timestamp type using SQL statements.

true

Index Configuration

Table 3 Index parameters

Parameter

Description

Default Value

hoodie.index.class

Full path of a user-defined index, which must be a subclass of HoodieIndex. When this parameter is specified, the configuration takes precedence over that of hoodie.index.type.

""

hoodie.index.type

Index type. The default value is BLOOM. The possible options are BLOOM, HBASE, GLOBAL_BLOOM, SIMPLE, and GLOBAL_SIMPLE. The Bloom filter eliminates the dependency on an external system and is stored in the footer of a Parquet data file.

BLOOM

hoodie.index.bloom.num_entries

This is the number of entries to be stored in the bloom filter. We assume the maxParquetFileSize is 128 MB and averageRecordSize is 1024 bytes and hence we approx a total of 130 KB records in a file. The default (60000) is roughly half of this approximation.

CAUTION:

Setting this very low will generate a lot of false positives and index lookup will have to scan a lot more files than it has to and setting this to a very high number will increase the size every data file linearly (roughly 4 KB for every 50,000 entries).

60000

hoodie.index.bloom.fpp

Error rate allowed given the number of entries. This is used to calculate how many bits should be assigned for the bloom filter and the number of hash functions. This is usually set very low (default: 0.000000001), we like to tradeoff disk space for lower false positives.

0.000000001

hoodie.bloom.index.parallelism

Parallelism for index lookup, which involves Spark shuffling. By default, it is automatically calculated based on input workload characteristics.

0

hoodie.bloom.index.prune.by.ranges

When true, range information from files to leveraged speed up index lookups. Particularly helpful, if the key has a monotonously increasing prefix, such as timestamp.

true

hoodie.bloom.index.use.caching

When true, the input RDD will be cached to speed up index lookup by reducing I/O for computing parallelism or affected partitions.

true

hoodie.bloom.index.use.treebased.filter

When true, interval tree based file pruning optimization is enabled. This mode speeds up file pruning based on key ranges when compared with the brute-force mode.

true

hoodie.bloom.index.bucketized.checking

When true, bucketized bloom filtering is enabled. This reduces skew seen in sort based bloom index lookup.

true

hoodie.bloom.index.keys.per.bucket

Only applies if bloomIndexBucketizedChecking is enabled and the index type is BLOOM.

This configuration controls the "bucket" size which tracks the number of record-key checks made against a single file and is the unit of work allocated to each partition performing bloom filter lookup. A higher value would amortize the fixed cost of reading a bloom filter to memory.

10000000

hoodie.bloom.index.update.partition.path

This parameter is applicable only when the index type is GLOBAL_BLOOM.

If this parameter is set to true, an update including the partition path of a record that already exists will result in the insertion of the incoming record into the new partition and the deletion of the original record in the old partition. If this parameter is set to false, the original record will only be updated in the old partition.

true

hoodie.index.hbase.zkquorum

Mandatory. This parameter is available only when the index type is HBASE. HBase ZooKeeper quorum URL to be connected.

None

hoodie.index.hbase.zkport

Mandatory. This parameter is available only when the index type is HBASE. HBase ZooKeeper quorum port to be connected.

None

hoodie.index.hbase.zknode.path

Mandatory. This parameter is available only when the index type is HBASE. It is the root znode that will contain all the znodes created and used by HBase.

None

hoodie.index.hbase.table

Mandatory. This parameter is available only when the index type is HBASE. HBase table name to be used as an index. Hudi stores the row_key and [partition_path, fileID, commitTime] mapping in the table.

None

Storage Configuration

Table 4 Storage parameter configuration

Parameter

Description

Default Value

hoodie.parquet.max.file.size

Specifies the target size for Parquet files generated in Hudi write phases. For DFS, this parameter needs to be aligned with the underlying file system block size for optimal performance.

120 * 1024 * 1024 byte

hoodie.parquet.block.size

Specifies the Parquet page size. Page is the unit of read in a Parquet file. In a block, pages are compressed separately.

120 * 1024 * 1024 byte

hoodie.parquet.compression.ratio

Specifies the expected compression ratio of Parquet data when Hudi attempts to adjust the size of a new Parquet file. If the size of the file generated by bulk_insert is smaller than the expected size, increase the value.

0.1

hoodie.parquet.compression.codec

Specifies the name of the Parquet compression encoding or decoding mode. The default value is gzip. Possible options are [gzip | snappy | uncompressed | lzo].

snappy

hoodie.logfile.max.size

Specifies the maximum size of LogFile. It is the maximum size allowed for a log file before it is rolled over to the next version.

1GB

hoodie.logfile.data.block.max.size

Specifies the maximum size of a LogFile data block. It is the maximum size allowed for a single data block to be appended to a log file. It helps to ensure that the data appended to the log file is broken up into sizable blocks to prevent OOM errors. The size should be greater than the JVM memory.

256MB

hoodie.logfile.to.parquet.compression.ratio

Specifies the expected additional compression when records move from log files to Parquet files. It is used for MOR tables to send inserted content into log files and control the size of compacted Parquet files.

0.35

Compaction and Cleaning Configurations

Table 5 Compaction & cleaning parameter configuration

Parameter

Description

Default Value

hoodie.clean.automatic

Specifies whether to perform automatic cleanup.

true

hoodie.cleaner.policy

Specifies the cleaning policy to be used. Hudi will delete the Parquet file of an old version to reclaim space. Any query or computation referring to this version of the file will fail. You are advised to ensure that the data retention time exceeds the maximum query execution time.

KEEP_LATEST_COMMITS

hoodie.cleaner.commits.retained

Specifies the number of commits to retain. Data will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into the number of datasets can be incrementally pulled.

10

hoodie.keep.max.commits

Number of commits that triggers the archiving operation.

30

hoodie.keep.min.commits

Number of commits reserved by the archiving operation.

20

hoodie.commits.archival.batch

This parameter controls the number of commit instants read in memory as a batch and archived together.

10

hoodie.parquet.small.file.limit

The value must be smaller than that of maxFileSize. If maxFileSize is set to 0, this function is disabled. Small files always exist because of the large number of insert records in a partition of batch processing. Hudi provides an option to solve the problem of small files by masking inserts into this partition as updates to existing small files. The size here is the minimum file size that is considered as a "small file size".

104857600 byte

hoodie.copyonwrite.insert.split.size

Specifies the parallelism for inserting and writing data. It is the number of inserts grouped for a single partition. Writing out 100 MB files with at least 1 KB records means 100 KB records exist in each file. Overprovision to 500 KB by default. To improve insert latency, adjust the value to match the number of records in a single file. If it is set to a smaller value, the file size will shrink (especially when compactionSmallFileSize is set to 0).

500000

hoodie.copyonwrite.insert.auto.split

Specifies whether Hudi dynamically computes insertSplitSize based on the last 24 commit metadata. This function is disabled by default.

true

hoodie.copyonwrite.record.size.estimate

Specifies the average record size. If specified, Hudi will use this parameter and not compute dynamically based on the last 24 commit metadata. There is no default value. This is critical in computing the insert parallelism and packing inserts into small files.

1024

hoodie.compact.inline

If this parameter is set to true, compaction is triggered by the ingestion itself right after a commit or delta commit action as part of insert, upsert, or bulk_insert.

true

hoodie.compact.inline.max.delta.commits

Specifies the maximum number of delta commits to be retained before inline compression is triggered.

5

hoodie.compaction.lazy.block.read

When CompactedLogScanner merges all log files, this parameter helps to choose whether the logblocks should be read lazily. Set it to true to use I/O-intensive lazy block read (low memory usage) or false to use memory-intensive immediate block read (high memory usage).

true

hoodie.compaction.reverse.log.read

HoodieLogFormatReader reads a log file in the forward direction from pos=0 to pos=file_length. If this parameter is set to true, Reader reads a log file in reverse direction from pos=file_length to pos=0.

false

hoodie.cleaner.parallelism

Increase this parameter if cleaning becomes slow.

200

hoodie.compaction.strategy

Determines which file groups are selected for compaction during each compaction run. By default, Hudi selects the log file with most accumulated unmerged data.

org.apache.hudi.table.action.compact.strategy.

LogFileSizeBasedCompactionStrategy

hoodie.compaction.target.io

Specifies the number of MBs to spend during compaction run for LogFileSizeBasedCompactionStrategy. This parameter can limit ingestion latency when compaction is run in inline mode.

500 * 1024 MB

hoodie.compaction.daybased.target.partitions

Used by org.apache.hudi.io.compact.strategy.DayBasedCompactionStrategy to denote the number of latest partitions to compact during a compaction run.

10

hoodie.compaction.payload.class

It needs to be same as class used during insert or upsert. Similar to writing, compaction also uses the record payload class to merge records in the log against each other, merge again with the base file, and produce the final record to be written after compaction.

org.apache.hudi.common.model.Defaulthoodierecordpayload

hoodie.schedule.compact.only.inline

Specifies whether to generate only a compression plan during a write operation. This parameter is valid only when hoodie.compact.inline is set to true.

false

hoodie.run.compact.only.inline

Specifies whether to perform only the compression operation when the run compaction command is executed using SQL. If the compression plan does not exist, no action is needed.

false

Single-Table Concurrency Control Configuration

Table 6 Single-table concurrency control configuration

Parameter

Description

Default Value

hoodie.write.lock.provider

Specifies the lock provider. You are advised to set the parameter to org.apache.hudi.hive.HiveMetastoreBasedLockProvider.

org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider

hoodie.write.lock.hivemetastore.database

Specifies the Hive database.

None

hoodie.write.lock.hivemetastore.table

Specifies the Hive table name.

None

hoodie.write.lock.client.num_retries

Specifies the retry times.

10

hoodie.write.lock.client.wait_time_ms_between_retry

Specifies the retry interval.

10000

hoodie.write.lock.conflict.resolution.strategy

Specifies the lock provider class, which must be a subclass of ConflictResolutionStrategy.

org.apache.hudi.client.transaction.SimpleConcurrentFileWritesConflictResolutionStrategy

hoodie.write.lock.zookeeper.base_path

Path for storing ZNodes. The parameter must be the same for all concurrent write configurations of the same table.

None

hoodie.write.lock.zookeeper.lock_key

ZNode name. It is recommended that the ZNode name be the same as the Hudi table name.

None

hoodie.write.lock.zookeeper.connection_timeout_ms

ZooKeeper connection timeout period.

15000

hoodie.write.lock.zookeeper.port

ZooKeeper port number.

None

hoodie.write.lock.zookeeper.url

URL of ZooKeeper.

None

hoodie.write.lock.zookeeper.session_timeout_ms

Session expiration time of ZooKeeper.

60000

Clustering Configuration

Clustering has two strategies: hoodie.clustering.plan.strategy.class and hoodie.clustering.execution.strategy.class. Typically, if hoodie.clustering.plan.strategy.class is set to SparkRecentDaysClusteringPlanStrategy or SparkSizeBasedClusteringPlanStrategy, hoodie.clustering.execution.strategy.class does not need to be specified. However, if hoodie.clustering.plan.strategy.class is set to SparkSingleFileSortPlanStrategy, hoodie.clustering.execution.strategy.class must be set to SparkSingleFileSortExecutionStrategy.

Table 7 Clustering parameter configuration

Parameter

Description

Default Value

hoodie.clustering.inline

Whether to execute clustering synchronously

false

hoodie.clustering.inline.max.commits

Number of commits that trigger clustering

4

hoodie.clustering.async.enabled

Whether to enable asynchronous clustering

false

hoodie.clustering.async.max.commits

Number of commits that trigger clustering during asynchronous execution

4

hoodie.clustering.plan.strategy.target.file.max.bytes

Maximum size of each file after clustering

1024 * 1024 * 1024 byte

hoodie.clustering.plan.strategy.small.file.limit

Files smaller than this size will be clustered.

300 * 1024 * 1024 byte

hoodie.clustering.plan.strategy.sort.columns

Columns used for sorting in clustering

None

hoodie.layout.optimize.strategy

Clustering execution strategy. Three sorting modes are available: linear, z-order, and hilbert.

linear

hoodie.layout.optimize.enable

Set this parameter to true when z-order or hilbert is used.

false

hoodie.clustering.plan.strategy.class

Strategy class for filtering file groups for clustering. By default, files whose size is less than the value of hoodie.clustering.plan.strategy.small.file.limit are filtered.

org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy

hoodie.clustering.execution.strategy.class

Strategy class for executing clustering (subclass of RunClusteringStrategy), which is used to define the execution mode of a cluster plan.

The default classes sort the file groups in the plan by the specified column and meet the configured target file size.

org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy

hoodie.clustering.plan.strategy.max.num.groups

Maximum number of file groups that can be selected during clustering. A larger value indicates a higher concurrency.

30

hoodie.clustering.plan.strategy.max.bytes.per.group

Maximum number of data records in each file group involved in clustering

2 * 1024 * 1024 * 1024 byte