Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive
Help Center/ Data Lake Insight/ Hudi SQL Syntax Reference/ Typical Hudi Configuration Parameters

Typical Hudi Configuration Parameters

Updated on 2025-02-22 GMT+08:00

This section describes important Hudi configurations. For details, visit the Hudi official website at https://hudi.apache.org/cn/docs/0.11.0/configurations/.

  • To set Hudi parameters while submitting a DLI Spark SQL job, access the SQL Editor page and click Settings in the upper right corner. In the Parameter Settings area, set the parameters.
  • When submitting DLI Spark jar jobs, Hudi parameters can be configured through the Spark datasource API's options.

    Alternatively, you can configure them in Spark Arguments(--conf) when submitting the job. Note that when configuring parameters here, the key needs to have the prefix spark.hadoop., for example, spark.hadoop.hoodie.compact.inline=true.

Write Configuration

Table 1 Write configuration parameters

Parameter

Description

Default Value

hoodie.datasource.write.table.name

Name of the Hudi table to write to

None

hoodie.datasource.write.operation

Operation type for writing to the Hudi table. Currently, upsert, delete, insert, and bulk_insert are supported.

  • upsert: updates and inserts data.
  • delete: deletes data.
  • insert: inserts data.
  • bulk_insert: imports data during initial table creation. Do not use upsert or insert during initial table creation.
  • insert_overwrite: performs insert and overwrite operations on static partitions.
  • insert_overwrite_table: performs insert and overwrite operations on dynamic partitions. It does not immediately delete the entire table or overwrite the table. Instead, it overwrites the metadata of the Hudi table logically, and Hudi deletes useless data through the clean mechanism. More efficient than bulk_insert + overwrite.

upsert

hoodie.datasource.write.table.type

Type of Hudi table. Once specified, this parameter cannot be modified later. Option: MERGE_ON_READ.

COPY_ON_WRITE

hoodie.datasource.write.precombine.field

Merges and reduplicates rows with the same key before write.

A specific table field

hoodie.datasource.write.payload.class

Class used to merge the records to be updated and the updated records during update. This parameter can be customized. You can compile it to implement your merge logic.

org.apache.hudi.common.model.DefaultHoodieRecordPayload

hoodie.datasource.write.recordkey.field

Unique primary key for the Hudi table

A specific table field

hoodie.datasource.write.partitionpath.field

Partition key. This parameter can be used together with hoodie.datasource.write.keygenerator.class to meet different partition needs.

None

hoodie.datasource.write.hive_style_partitioning

Whether to specify a partition mode that is the same as that of Hive. Set it to true.

true

hoodie.datasource.write.keygenerator.class

Used with hoodie.datasource.write.partitionpath.field and hoodie.datasource.write.recordkey.field to generate the primary key and partition mode.

NOTE:

If the value of this parameter is different from that saved in the table, a message is displayed, indicating that the value must be the same.

org.apache.hudi.keygen.ComplexKeyGenerator

Configuration of Hive Table Synchronization

The metadata service provided by DLI is a Hive Metastore service (HMS), so the following parameters are related to synchronizing the metadata service.

Table 2 Parameters for synchronizing Hive tables

Parameter

Description

Default Value

hoodie.datasource.hive_sync.enable

Whether to synchronize Hudi tables to Hive. When using the metadata service provided by DLI, configuring this parameter means synchronizing to the metadata of DLI.

CAUTION:

You are advised to set it to true to use the metadata service to manage Hudi tables.

false

hoodie.datasource.hive_sync.database

Name of the database to be synchronized to Hive

default

hoodie.datasource.hive_sync.table

Name of the table to be synchronized to Hive. Set it to the value of hoodie.datasource.write.table.name.

unknown

hoodie.datasource.hive_sync.partition_fields

Hive partition columns

""

hoodie.datasource.hive_sync.partition_extractor_class

Class used to extract Hudi partition column values and convert them into Hive partition columns.

org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor

hoodie.datasource.hive_sync.support_timestamp

If the Hudi table contains a field of the timestamp type, set this parameter to true to synchronize the timestamp type to the Hive metadata. The default value is false, indicating that the timestamp type is converted to bigint during synchronization by default. In this case, an error may occur when you query a Hudi table that contains a field of the timestamp type using SQL statements.

true

hoodie.datasource.hive_sync.username

Username specified when synchronizing Hive using JDBC

hive

hoodie.datasource.hive_sync.password

Password specified when synchronizing Hive using JDBC

hive

hoodie.datasource.hive_sync.jdbcurl

JDBC URL specified for connecting to Hive

""

hoodie.datasource.hive_sync.use_jdbc

Whether to use Hive JDBC to synchronize Hudi table information to Hive. You are advised to set this parameter to false. When set to false, the JDBC connection-related configuration will be invalid.

true

Index Configuration

Table 3 Index parameters

Parameter

Description

Default Value

hoodie.index.class

Full path of a user-defined index, which must be a subclass of HoodieIndex. When this parameter is specified, the configuration takes precedence over that of hoodie.index.type.

""

hoodie.index.type

Index type. The default value is BLOOM. Possible options are BLOOM, GLOBAL_BLOOM, SIMPLE, and GLOBAL_SIMPLE. The Bloom filter eliminates the dependency on an external system and is stored in the footer of a Parquet data file.

BLOOM

hoodie.index.bloom.num_entries

Number of entries stored in the Bloom filter. Assuming maxParquetFileSize is 128 MB and averageRecordSize is 1024 bytes, the total number of records in a file is about 130 KB. The default value (60000) is about half of this approximation.

CAUTION:

Setting this value too low will result in many false positives, and the index lookup will need to scan more files than necessary. Setting it too high will linearly increase the size of each data file (approximately 4 KB for every 50,000 entries).

60000

hoodie.index.bloom.fpp

The allowed error rate based on the number of entries. Used to calculate the number of bits to allocate for the Bloom filter and the number of hash functions. This value is typically set very low (default: 0.000000001) to trade off disk space for a lower false positive rate.

0.000000001

hoodie.bloom.index.parallelism

Parallelism for index lookup involving Spark Shuffle. By default, it is automatically calculated based on input workload characteristics.

0

hoodie.bloom.index.prune.by.ranges

When set to true, file range information can speed up index lookups. This is particularly useful if the keys have a monotonically increasing prefix, such as timestamps.

true

hoodie.bloom.index.use.caching

When set to true, the input RDD is cached to speed up index lookups by reducing IO required for calculating parallelism or affected partitions.

true

hoodie.bloom.index.use.treebased.filter

When set to true, tree-based file filter optimization is enabled. Compared to brute force, this mode speeds up file filtering based on key ranges.

true

hoodie.bloom.index.bucketized.checking

When set to true, bucketized Bloom filtering is enabled. This reduces bias seen in sort-based Bloom index lookups.

true

hoodie.bloom.index.keys.per.bucket

This parameter is available only wehn bloomIndexBucketizedChecking is enabled and the index type is BLOOM.

This configuration controls the size of the "bucket", which tracks the number of record key checks performed on a single file and serves as the work unit assigned to each partition executing Bloom filter lookups. Higher values will amortize the fixed cost of reading the Bloom filter into memory.

10000000

hoodie.bloom.index.update.partition.path

This parameter is applicable only when the index type is GLOBAL_BLOOM.

When set to true, updating a record that includes a partition path will insert the new record into the new partition and delete the original record from the old partition. When set to false, only the original record in the old partition is updated.

true

Storage Configuration

Table 4 Storage parameter configuration

Parameter

Description

Default Value

hoodie.parquet.max.file.size

Target size of the Parquet files generated during the Hudi write phase. For DFS, this should align with the underlying file system block size for optimal performance.

120 * 1024 * 1024 byte

hoodie.parquet.block.size

Parquet page size, which is the read unit in a parquet file. Pages within a block are compressed separately.

120 * 1024 * 1024 byte

hoodie.parquet.compression.ratio

Expected compression ratio for Parquet data when Hudi tries to size new parquet files. Increase this value if the files generated by bulk_insert are smaller than expected.

0.1

hoodie.parquet.compression.codec

Name of the parquet compression codec. Default is gzip. Possible options are gzip, snappy, uncompressed, and lzo.

snappy

hoodie.logfile.max.size

Maximum size of the LogFile. This is the maximum size allowed before rolling over to a new version of the log file.

1 GB

hoodie.logfile.data.block.max.size

Maximum size of the LogFile data block. This is the maximum size of a single data block appended to the log file. It helps ensure that data appended to the log file is broken down into manageable blocks to prevent OOM errors. This size should be greater than the JVM memory.

256 MB

hoodie.logfile.to.parquet.compression.ratio

Expected compression ratio as records move from log files to Parquet. Used in MOR storage to control the size of compressed Parquet files.

0.35

Compaction and Cleaning Configuration

Table 5 Compaction and cleaning parameters

Parameter

Description

Default Value

hoodie.clean.automatic

Whether to perform automatic cleaning

true

hoodie.cleaner.policy

Cleaning policy to use. Hudi will remove old versions of Parquet files to reclaim space. Any query or computation referencing this version will fail. Ensure data retention exceeds the maximum query execution time.

KEEP_LATEST_COMMITS

hoodie.cleaner.commits.retained

Number of commits to retain. Data will be retained for num_of_commits * time_between_commits (planned), which directly translates to the number of incremental pulls on this dataset.

10

hoodie.keep.max.commits

Threshold for the number of commits to trigger archival.

30

hoodie.keep.min.commits

Number of commits to retain for archival

20

hoodie.commits.archival.batch

Controls the number of commit instants to read and archive together in a batch.

10

hoodie.parquet.small.file.limit

Should be less than maxFileSize. If set to 0, this function is disabled. Due to the large number of records inserted into partitions in batch processing, small files will always appear. Hudi provides an option to address the small file problem by treating inserts into the partition as updates to existing small files. The size here is the minimum file size considered a "small file size".

104857600 bytes

hoodie.copyonwrite.insert.split.size

Parallelism for insert writes. Total number of inserts for a single partition. Writing out 100 MB files, with records at least 1 KB in size, means 100 KB records per file. Default is over-provisioned to 500 KB. Adjust this to match the number of records in a single file to improve insert latency. Setting this value smaller results in smaller files (especially if compactionSmallFileSize is 0).

500000

hoodie.copyonwrite.insert.auto.split

Whether Hudi should dynamically calculate insertSplitSize based on the last 24 commits' metadata. Default is false.

true

hoodie.copyonwrite.record.size.estimate

Average record size. If specified, Hudi will use it instead of dynamically calculating based on the last 24 commits' metadata. No default value. Crucial for calculating insert parallelism and packing inserts into small files.

1024

hoodie.compact.inline

When set to true, compaction is triggered by the ingestion itself immediately after the insert, upsert, bulk insert, or incremental commit operations.

true

hoodie.compact.inline.max.delta.commits

Maximum number of delta commits to retain before triggering inline compaction.

5

hoodie.compaction.lazy.block.read

Helps choose whether to delay reading log blocks when CompactedLogScanner merges all log files. Set it to true for I/O-intensive delayed block reading (low memory usage), or false for memory-intensive immediate block reading (high memory usage).

true

hoodie.compaction.reverse.log.read

HoodieLogFormatReader reads log files forward from pos=0 to pos=file_length. If set to true, the reader reads log files backward from pos=file_length to pos=0.

false

hoodie.cleaner.parallelism

Increase this value if cleaning is slow.

200

hoodie.compaction.strategy

Strategy to determine which file groups to compact during each compaction run. By default, Hudi selects log files with the most unmerged data accumulated.

org.apache.hudi.table.action.compact.strategy.

LogFileSizeBasedCompactionStrategy

hoodie.compaction.target.io

Amount of MB to spend during the compaction run in LogFileSizeBasedCompactionStrategy. This value helps limit ingestion delays when compaction runs in inline mode.

500 * 1024 MB

hoodie.compaction.daybased.target.partitions

Used by org.apache.hudi.io.compact.strategy.DayBasedCompactionStrategy, representing the latest number of partitions to compact during the compaction run.

10

hoodie.compaction.payload.class

Needs to be the same class used during insert/upsert operations. Like writing, compaction uses the record payload class to merge records from the logs with each other, then with the base file again, and generate the final records to be written post-compaction.

org.apache.hudi.common.model.Defaulthoodierecordpayload

hoodie.schedule.compact.only.inline

Whether to only generate a compaction plan during write operations. Valid when hoodie.compact.inline is true.

false

hoodie.run.compact.only.inline

Whether to only perform compaction operations when executing the run compaction command through SQL. If the compaction plan does not exist, it exits directly.

false

Single-Table Concurrency Control

Table 6 Single-table concurrency control configuration

Parameter

Description

Default Value

hoodie.write.lock.provider

Lock provider. In scenarios where metadata is managed by DLI, you are advised to set it to com.huawei.luxor.hudi.util.DliCatalogBasedLockProvider.

Spark SQL and Flink SQL jobs will switch to the corresponding implementation class based on the metadata service. For scenarios where metadata is managed by DLI, use com.huawei.luxor.hudi.util.DliCatalogBasedLockProvider.

hoodie.write.lock.hivemetastore.database

Database in the HMS service

None

hoodie.write.lock.hivemetastore.table

Table name in the HMS service

None

hoodie.write.lock.client.num_retries

Number of retries

10

hoodie.write.lock.client.wait_time_ms_between_retry

Retry interval

10000

hoodie.write.lock.conflict.resolution.strategy

Lock provider class, must be a subclass of ConflictResolutionStrategy.

org.apache.hudi.client.transaction.SimpleConcurrentFileWritesConflictResolutionStrategy

Clustering Configuration

NOTE:

There are two strategies in clustering: hoodie.clustering.plan.strategy.class and hoodie.clustering.execution.strategy.class. Typically, when hoodie.clustering.plan.strategy.class is set to SparkRecentDaysClusteringPlanStrategy or SparkSizeBasedClusteringPlanStrategy, there is no need to specify hoodie.clustering.execution.strategy.class. However, when hoodie.clustering.plan.strategy.class is SparkSingleFileSortPlanStrategy, hoodie.clustering.execution.strategy.class should be set to SparkSingleFileSortExecutionStrategy.

Table 7 Clustering parameters

Parameter

Description

Default Value

hoodie.clustering.inline

Whether to execute clustering synchronously

false

hoodie.clustering.inline.max.commits

Number of commits to trigger clustering

4

hoodie.clustering.async.enabled

Whether to enable asynchronous clustering

false

hoodie.clustering.async.max.commits

Number of commits to trigger asynchronous clustering

4

hoodie.clustering.plan.strategy.target.file.max.bytes

Maximum file size after clustering

1024 * 1024 * 1024 byte

hoodie.clustering.plan.strategy.small.file.limit

Files smaller than this size will be clustered

300 * 1024 * 1024 byte

hoodie.clustering.plan.strategy.sort.columns

Columns used for sorting in clustering

None

hoodie.layout.optimize.strategy

Clustering execution strategy. The options are linear, z-order, and hilbert.

linear

hoodie.layout.optimize.enable

Set this parameter to true when z-order or hilbert is used.

false

hoodie.clustering.plan.strategy.class

Strategy class for filtering file groups for clustering. By default, files smaller than the value of hoodie.clustering.plan.strategy.small.file.limit are filtered.

org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy

hoodie.clustering.execution.strategy.class

Strategy class for executing clustering (subclass of RunClusteringStrategy), which defines how to execute a clustering plan.

The default class sorts the file groups in the plan by specified columns while meeting the target file size configuration.

org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy

hoodie.clustering.plan.strategy.max.num.groups

Maximum number of FileGroups to select for clustering at execution. Higher values increase concurrency.

30

hoodie.clustering.plan.strategy.max.bytes.per.group

Maximum data per FileGroup to participate in clustering at execution

2 * 1024 * 1024 * 1024 byte

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback