Hudi Data Table Clean Specifications

Clean is also one of the maintenance operations for Hudi tables and needs to be executed for both MOR and COW tables. The purpose of the Clean operation is to remove old version files (data files no longer used by Hudi), which not only saves time during the Hudi table list process but also alleviates storage pressure.

Rules

Clean operations must be performed for Hudi tables.

For Hudi's MOR and COW tables, clean must be enabled.

When writing data to Hudi tables, the system will automatically determine whether to execute clean, as the clean function is enabled by default (hoodie.clean.automatic is true).
The clean operation is not triggered every time data is written; at least two conditions must be met:
1. There must be old version table files in Hudi. For COW tables, as long as the data has been updated, old version files will exist. For MOR tables, the data must have been updated and compaction must have been performed to have old version files.
2. The Hudi table meets the threshold specified by hoodie.cleaner.commits.retained. If writing to Hudi with Flink, at least the submitted checkpoint must exceed this threshold; if writing to Hudi in batches, the number of batches must exceed this threshold.

Recommendations

For MOR tables downstream using batch read mode, the number of clean versions should be compaction versions plus 1.
MOR tables must ensure the compaction plan is successfully executed. The compaction plan only records which log files and which Parquet files in the Hudi table should be merged, so the most important point is to ensure that the files to be merged exist when executing the compaction plan. Since only the clean operation can clean files in the Hudi table, it is recommended that the clean trigger threshold (hoodie.cleaner.commits.retained value) should be greater than the compaction trigger threshold (compaction.delta_commits value for Flink tasks).
For MOR tables with downstream stream computing, retain historical versions at the hourly level.
If the downstream of the MOR table is stream computing, such as Flink streaming reads, retain hourly historical versions as needed. This way, incremental data within the last few hours can be read through log files. If the retention period is too short, in case of downstream Flink job restarts or interruptions, the upstream incremental data might already be Cleaned, and Flink will need to read incremental data from Parquet files, which will reduce performance. If the retention period is too long, it will lead to redundant storage of historical data in logs.

To retain 2 hours of historical version data, use the following formula:

Number of versions = 3600 x 2/Version interval time, where the version interval time comes from the Flink job's checkpoint cycle or the upstream batch write cycle.
For COW tables without special requirements for retaining historical version data, set the number of retained versions to 1.
Each version of the COW table contains the entire table data, so retaining multiple versions leads to redundancy. Therefore, if there is no need for historical data retrieval, set the number of retained versions to 1, retaining only the latest version.
Execute clean jobs at least once a day, or every 2 to 4 hours.
Both Hudi's MOR and COW tables need to ensure at least one clean operation daily. The clean operation for MOR tables can be executed asynchronously with compaction, as referenced in section 2.2.1.6. For COW tables, clean can be automatically determined during data writing.