Hudi Clean
What Is Clean?
Cleaning is used to remove old version data files (Parquet files or log files) in Hudi tables that are no longer needed. This reduces storage pressure and improves the efficiency of list operations.
How to Execute Clean?
- Clean after writing data.
- Spark SQL (Set the following parameters, trigger on data write when the condition is met)
hoodie.clean.automatic=true hoodie.cleaner.commits.retained=10 // The default value is 10, but you can adjust it based on the service scenario.
- SparkDataSource (Set the following parameters in the option, trigger on data write)
hoodie.cleaner.commits.retained=10 // The default value is 10, but you can adjust it based on the service scenario.
- Flink (Set the following parameters in the with attribute, trigger on data write)
clean.retain_commits=10 // The default value is 10, but you can adjust it based on the service scenario.
- Spark SQL (Set the following parameters, trigger on data write when the condition is met)
- Manually trigger clean once.
- Spark SQL (Set the following parameters, manually trigger once)
hoodie.clean.automatic=true hoodie.cleaner.commits.retained=10 // The default value is 10, but you can adjust it based on the service scenario.
Execute the SQL. When there are more than 10 instant records in the timeline, the clean operation is triggered:
run clean on ${table_name}
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot