Updated on 2024-11-29 GMT+08:00

Partition Concurrency Control

Each task determines whether a write conflict occurs based on the modified partition information stored in the commit operation in the inflight state. In this way, concurrent write is implemented.

Lock control during concurrency is implemented based on ZooKeeper locking. You do not need to configure additional parameters.

Precautions

Concurrent write control for partitions is implemented based on concurrent write control for a single table. So, the constraints are basically the same as those for the latter.

Currently, data can be concurrently written to partitions only in Spark.

To prevent a large number of concurrent requests from occupying too many ZooKeeper resources, a quota limit function is added to Hudi on ZooKeeper. You can modify the zk.quota.number parameter of Spark on the server to adjust the quota of Hudi. The default value is 500000, and the minimum value is 5. This parameter cannot be used to control the number of concurrent tasks. It is used only to control the access pressure on ZooKeeper.

Using Partition Concurrency

Set hoodie.support.partition.lock to true to enable concurrent partition write.

Example:

Enable concurrent partition write in Spark datasource mode:

upsert_data.write.format("hudi").
option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
option("hoodie.datasource.write.precombine.field", "col2").
option("hoodie.datasource.write.recordkey.field", "primary_key").
option("hoodie.datasource.write.partitionpath.field", "col0").
option("hoodie.upsert.shuffle.parallelism", 4).
option("hoodie.datasource.write.hive_style_partitioning", "true").
option("hoodie.support.partition.lock", "true").
option("hoodie.table.name", "tb_test_cow").
mode("Append").save(s"/tmp/huditest/tb_test_cow")

Enable concurrent partition write in Spark SQL mode:

set hoodie.support.partition.lock=true;
insert into hudi_table1 select 1,1,1;