Hudi Data Table Archive Specifications

Archive is intended to alleviate the metadata read and write pressure on Hudi. All metadata is stored in this path: the root directory of the Hudi table/.hoodie directory. If the number of files in the .hoodie directory exceeds 10,000, there will be very noticeable read and write delays in the Hudi table.

Rules

Archive operations must be performed for Hudi tables.

For Hudi's MOR and COW tables, archive must be enabled.

When writing data to Hudi tables, the system will automatically determine whether to execute archive, as the archive function is enabled by default (hoodie.archive.automatic is true).
The archive operation is not triggered every time data is written; at least two conditions must be met:
1. The Hudi table meets the threshold specified by hoodie.keep.max.commits. If writing to Hudi with Flink, at least the submitted checkpoint must exceed this threshold; if writing to Hudi with Spark, the number of times data is written to Hudi must exceed this threshold.
2. A clean operation has been performed on the Hudi table; otherwise, the archive operation will not be executed (ignore this condition for MRS 3.3.1-LTS and later versions).

Recommendations

Execute archive jobs at least once a day, or every 2 to 4 hours.

Both Hudi's MOR and COW tables need to ensure at least one archive operation daily. The archive operation for MOR tables can be executed asynchronously with compaction, as referenced in section 2.2.1.6. For COW tables, archive can be automatically determined during data writing.

Parent topic: Hudi Data Table Management Operation Specifications

Previous topic: Hudi Data Table Clean Specifications

Next topic: Spark on Hudi Development Specifications