Hudi Archive
What Is Archive?
Archiving is used to clean up metadata files in Hudi tables (located in the .hoodie directory, formatted as ${timestamp}.${operation_type}.${operation_status}, for example, 20240622143023546.deltacommit.request). Each operation on a Hudi table generates metadata files, and too many metadata files can lead to performance issues. Therefore, it is best to keep the number of metadata files under 1,000.
How to Execute Archive?
- Archive after writing data.
- Spark SQL (Set the following parameters, trigger on data write)
hoodie.archive.automatic=true hoodie.keep.max.commits=30 // The default value is 30, but you can adjust it based on the service scenario. hoodie.keep.min.commits=20 // The default value is 20, but you can adjust it based on the service scenario.
- SparkDataSource (Set the following parameters in the option, trigger on data write)
hoodie.keep.max.commits=30 // The default value is 30, but you can adjust it based on the service scenario.
hoodie.keep.min.commits=20 // The default value is 20, but you can adjust it based on the service scenario.
- Flink (Set the following parameters in the with attribute, trigger on data write)
archive.max_commits=30 // The default value is 30, but you can adjust it based on the service scenario.
archive.min_commits=20 // The default value is 20, but you can adjust it based on the service scenario.
- Spark SQL (Set the following parameters, trigger on data write)
- Manually trigger archive once.
- Spark SQL (Set the following parameters, manually trigger once)
hoodie.archive.automatic=true hoodie.keep.max.commits=30 // The default value is 30, but you can adjust it based on the service scenario. hoodie.keep.min.commits=20 // The default value is 20, but you can adjust it based on the service scenario.
Execute the SQL. When a clean operation has been performed and the timeline contains instant records of data files that have already been cleaned, and the total number of instant records exceeds 30, the archive operation is triggered.
run archivelog on ${table_name}
- Spark SQL (Set the following parameters, manually trigger once)
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot