Updated on 2024-05-11 GMT+08:00

Suggestions

Currently, Hudi is mainly applicable to real-time data import to the lake and incremental data ETL. Stored historical data can be imported to Hudi tables in batches.

Copy on write (COW) tables apply to scenarios where the incremental data is basically new data and that have high requirements on data read performance.

Merge on read (MOR) tables apply to scenarios that have high requirements on data import performance and where the incremental data contains a large amount of added and updated data.

You are advised to use the date field to set partition paths in hoodie keys.

Configure Hudi resources for real-time data importing to the data lake in based on the number of Kafka partitions. One Kafka partition can be consumed by only one executor-core. Therefore, setting excessive executor-cores wastes resources.

Set the consumption batch parameters for Spark Streaming to write data to the data lake based on site requirements. Ensure that the interval between two batches is slightly smaller than the time required for consuming a batch of messages to write data into the Hudi table.

The degree of parallelism (DOP) of Hudi write operations cannot be too large. A proper DOP helps shorten the processing time.