Hudi Data Table Archive Specifications
Archive is intended to alleviate the metadata read and write pressure on Hudi. All metadata is stored in this path: the root directory of the Hudi table/.hoodie directory. If the number of files in the .hoodie directory exceeds 10,000, there will be very noticeable read and write delays in the Hudi table.
Rules
Archive operations must be performed for Hudi tables.
For Hudi's MOR and COW tables, archive must be enabled.
- When writing data to Hudi tables, the system will automatically determine whether to execute archive, as the archive function is enabled by default (hoodie.archive.automatic is true).
- The archive operation is not triggered every time data is written; at least two conditions must be met:
- The Hudi table meets the threshold specified by hoodie.keep.max.commits. If writing to Hudi with Flink, at least the submitted checkpoint must exceed this threshold; if writing to Hudi with Spark, the number of times data is written to Hudi must exceed this threshold.
- A clean operation has been performed on the Hudi table; otherwise, the archive operation will not be executed (ignore this condition for MRS 3.3.1-LTS and later versions).
Recommendations
Execute archive jobs at least once a day, or every 2 to 4 hours.
Both Hudi's MOR and COW tables need to ensure at least one archive operation daily. The archive operation for MOR tables can be executed asynchronously with compaction, as referenced in section 2.2.1.6. For COW tables, archive can be automatically determined during data writing.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot