Hudi Compaction

What Is Compaction?

Compaction is used to merge the base and log files of merge-on-read (MOR) tables. Compaction involves two processes: Schedule and Run. The schedule process generates a compaction plan in the timeline, which records which Parquet files will be merged with which log files. However, this is just a plan and does not perform the merge. The run process executes all the compaction plans in the timeline one by one until all are completed.

For MOR tables, data is stored using columnar Parquet files and row-based Avro files. Updates are recorded in incremental files, and then synchronous/asynchronous compaction is performed to generate new versions of the columnar files. MOR tables can reduce data ingestion latency, making asynchronous compaction that does not block ingestion meaningful.

How to Execute Compaction?

Schedule only
- Spark SQL (Set the following parameters, trigger on data write)
```
hoodie.compact.inline=true
hoodie.schedule.compact.only.inline=true
hoodie.run.compact.only.inline=false
hoodie.compact.inline.max.delta.commits=5 // The default value is 5, but you can adjust it based on the service scenario.
```
  After executing any write SQL, compaction will be triggered when the condition is met (for example, there are 5 delta log files under the same file slice).
- Spark SQL (Set the following parameters, manually trigger once)
```
hoodie.compact.inline=true
hoodie.schedule.compact.only.inline=true
hoodie.run.compact.only.inline=false
hoodie.compact.inline.max.delta.commits=5 // The default value is 5, but you can adjust it based on the service scenario.
```
  Then manually execute SQL:
```
schedule compaction on ${table_name}
```
- SparkDataSource (Set the following parameters in the option, trigger on data write)
  hoodie.compact.inline=true
  
  hoodie.schedule.compact.only.inline=true
  
  hoodie.run.compact.only.inline=false
  
  hoodie.compact.inline.max.delta.commits=5 // The default value is 5, but you can adjust it based on the service scenario.
- Flink (Set the following parameters in the with attribute, trigger on data write)
  compaction.async.enabled=false
  
  compaction.schedule.enabled=true
  
  compaction.delta_commits=5 // The default value is 5, but you can adjust it based on the service scenario.

Run only

Spark SQL (Set the following parameters, manually trigger once)

hoodie.compact.inline=true
hoodie.schedule.compact.only.inline=false
hoodie.run.compact.only.inline=true

Then execute the following SQL:

run compaction on ${table_name}

Execute Schedule and Run together
If there is no compaction plan in the timeline, it will attempt to generate and execute a compaction plan.
- Spark SQL (Set the following parameters, trigger on data write when the condition is met)
```
hoodie.compact.inline=true
hoodie.schedule.compact.only.inline=false
hoodie.run.compact.only.inline=false
hoodie.compact.inline.max.delta.commits=5 // The default value is 5, but you can adjust it based on the service scenario.
```
- SparkDataSource (Set the following parameters in the option, trigger on data write)
  hoodie.compact.inline=true
  
  hoodie.schedule.compact.only.inline=false
  
  hoodie.run.compact.only.inline=false
  
  hoodie.compact.inline.max.delta.commits=5 // The default value is 5, but you can adjust it based on the service scenario.
- Flink (Set the following parameters in the with attribute, trigger on data write)
  compaction.async.enabled=true
  
  compaction.schedule.enabled=false
  
  compaction.delta_commits=5 // The default value is 5, but you can adjust it based on the service scenario.
Recommended approach
- Spark/Flink Streaming jobs: Execute only Schedule, and then set up a separate Spark SQL job to execute Run at regular intervals.
- Spark batch jobs: Execute Schedule and Run together directly.
  
  To ensure the highest efficiency of data lake ingestion, you are advised to produce the compaction schedule plan synchronously and execute the compaction schedule plan asynchronously.