Updated on 2024-08-30 GMT+08:00

Hudi Data Table Archive Specifications

Archive is used to reduce the pressure on Hudi to read and write metadata. All metadata is stored in the following path: Hudi table root directory/.hoodie. If the number of files in the .hoodie directory exceeds 10000, the Hudi table has obvious read and write latency.

rules

Archive must be executed for the Hudi table.

For the MOR and COW tables of Hudi, the Archive function must be enabled.

  • When data is written to the Hudi table, the system automatically determines whether to perform the Archive operation because the Archive function is enabled by default (hoodie.archive.automatic is set to true by default).
  • The Archive operation is not triggered every time data is written. At least the following conditions must be met:
    1. The Hudi table meets the threshold specified by hoodie.keep.max.commits. If the Flink writes data to the hudi, the number of checkpoints submitted must exceed the threshold. If Spark writes data to the hudi, the number of times that the hudi is written must exceed the threshold.
    2. The Hudi table has been cleaned. If the Hudi table is not cleaned, the Archive operation will not be executed. (Ignore this condition in MRS 3.3.1-LTS and later versions.)

The suggestion

The Archive job must be executed at least once a day, which can be executed every two to four hours.

The MOR and COW tables of Hudi must be archived at least once a day. For details about how to archive the MOR and COW tables, see section 2.2.1.6. The archive function of the COW can automatically determine whether to execute the data write operation.