Hudi Compaction
What Is Compaction?
Compaction is used to merge the base and log files of merge-on-read (MOR) tables. Compaction involves two processes: Schedule and Run. The schedule process generates a compaction plan in the timeline, which records which Parquet files will be merged with which log files. However, this is just a plan and does not perform the merge. The run process executes all the compaction plans in the timeline one by one until all are completed.
For MOR tables, data is stored using columnar Parquet files and row-based Avro files. Updates are recorded in incremental files, and then synchronous/asynchronous compaction is performed to generate new versions of the columnar files. MOR tables can reduce data ingestion latency, making asynchronous compaction that does not block ingestion meaningful.
How to Execute Compaction?
- Schedule only
- Spark SQL (Set the following parameters, trigger on data write)
hoodie.compact.inline=true hoodie.schedule.compact.only.inline=true hoodie.run.compact.only.inline=false hoodie.compact.inline.max.delta.commits=5 // The default value is 5, but you can adjust it based on the service scenario.
After executing any write SQL, compaction will be triggered when the condition is met (for example, there are 5 delta log files under the same file slice).
- Spark SQL (Set the following parameters, manually trigger once)
hoodie.compact.inline=true hoodie.schedule.compact.only.inline=true hoodie.run.compact.only.inline=false hoodie.compact.inline.max.delta.commits=5 // The default value is 5, but you can adjust it based on the service scenario.
Then manually execute SQL:
schedule compaction on ${table_name}
- SparkDataSource (Set the following parameters in the option, trigger on data write)
hoodie.schedule.compact.only.inline=true
hoodie.run.compact.only.inline=false
hoodie.compact.inline.max.delta.commits=5 // The default value is 5, but you can adjust it based on the service scenario.
- Spark SQL (Set the following parameters, trigger on data write)
- Run only
- Spark SQL (Set the following parameters, manually trigger once)
hoodie.compact.inline=true hoodie.schedule.compact.only.inline=false hoodie.run.compact.only.inline=true
Then execute the following SQL:
run compaction on ${table_name}
- Spark SQL (Set the following parameters, manually trigger once)
- Execute Schedule and Run together
If there is no compaction plan in the timeline, it will attempt to generate and execute a compaction plan.
- Spark SQL (Set the following parameters, trigger on data write when the condition is met)
hoodie.compact.inline=true hoodie.schedule.compact.only.inline=false hoodie.run.compact.only.inline=false hoodie.compact.inline.max.delta.commits=5 // The default value is 5, but you can adjust it based on the service scenario.
- SparkDataSource (Set the following parameters in the option, trigger on data write)
hoodie.schedule.compact.only.inline=false
hoodie.run.compact.only.inline=false
hoodie.compact.inline.max.delta.commits=5 // The default value is 5, but you can adjust it based on the service scenario.
- Flink (Set the following parameters in the with attribute, trigger on data write)
compaction.schedule.enabled=false
compaction.delta_commits=5 // The default value is 5, but you can adjust it based on the service scenario.
- Spark SQL (Set the following parameters, trigger on data write when the condition is met)
- Recommended approach
- Spark/Flink Streaming jobs: Execute only Schedule, and then set up a separate Spark SQL job to execute Run at regular intervals.
- Spark batch jobs: Execute Schedule and Run together directly.
To ensure the highest efficiency of data lake ingestion, you are advised to produce the compaction schedule plan synchronously and execute the compaction schedule plan asynchronously.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot