Updated on 2024-09-10 GMT+08:00

Hudi Data Table Clean Specifications

Clean is also one of the maintenance operations of the Hudi table. This operation needs to be performed on both the MOR and COW tables. The Clean operation is used to clear the files of the old version (data files that are not used by Hudi anymore). This not only saves the time for the Hudi table list process, but also relieves the storage pressure.

rules

The Hudi table must be cleaned.

Clean must be enabled for the MOR and COW tables of Hudi.

  • When data is written to the Hudi table, the system automatically determines whether to clean the Hudi table because the clean function is enabled by default (hoodie.clean.automatic is set to true by default).
  • The Clean operation is not triggered every time data is written. At least two conditions must be met:
    1. The Hudi table requires an old version of the file. For COW tables, files of earlier versions must exist as long as the data is updated. For the MOR table, ensure that the data has been updated and compaction has been performed so that the file of the earlier version can be available.
    2. The Hudi table meets the threshold specified by hoodie.cleaner.commits.retained. If Flink writes hudi, the number of checkpoints submitted must exceed the threshold. For batch Hudi, the number of batch write times must exceed the threshold.

The suggestion

  • The downstream MOR table uses the batch read mode. The number of clean versions is the number of compaction versions plus 1.

    The MOR table must ensure that the compaction plan can be successfully executed. The compaction plan only records the log files in the Hudi table and the Parquet files to be merged. Therefore, the most important point is to ensure that all the files to be merged exist when the compaction plan is executed. In the Hudi table, only the Clean operation can clear files. Therefore, it is recommended that the Clean triggering threshold (the value of hoodie.cleaner.commits.retained) be at least greater than the Compaction triggering threshold. (For the Flink task, the value is the value of compaction.delta_commits.)

  • Flow calculation is used in the downstream direction of the MOR table. In earlier versions, hour-level calculation is retained.

    If the downstream of the MOR table uses streaming computing, such as Flink streaming read, the historical version can be stored in hours based on service requirements. In this way, incremental data in the last few hours can be read from log files. If the retention duration is too short, When the downstream Flink job is restarted or blocked due to abnormal interruption, the upstream incremental data has been cleaned. Flink needs to read the incremental data from the parquet file, and the performance deteriorates. If the retention period is too long, historical data in logs will be redundantly stored.

    You can reserve the historical version data for two hours according to the following formula:

    Set the number of versions to 3600 x 2/Version interval. The version interval is obtained from the checkpoint period of the Flink job or the upstream batch write period.

  • If the service does not have special requirements for storing historical version data in the COW table, set the number of versions to 1.

    Each version of a COW table contains full data of the table. The number of versions that are retained depends on the number of versions that are redundant. Therefore, if the service does not require historical data backtracking, set the number of retained versions to 1, that is, retain the latest version.

  • The clean job must be executed at least once a day, which can be executed every 2 to 4 hours.

    The MOR and COW tables of Hudi must be cleaned at least once a day. For details about how to clean the MOR and COW tables, see section 2.2.1.6. The clean function of the COW can automatically determine whether to perform the clean operation when writing data.