Updated on 2025-02-22 GMT+08:00

Hudi Compaction

What Is Compaction?

Compaction is used to merge the base and log files of merge-on-read (MOR) tables. Compaction involves two processes: Schedule and Run. The schedule process generates a compaction plan in the timeline, which records which Parquet files will be merged with which log files. However, this is just a plan and does not perform the merge. The run process executes all the compaction plans in the timeline one by one until all are completed.

For MOR tables, data is stored using columnar Parquet files and row-based Avro files. Updates are recorded in incremental files, and then synchronous/asynchronous compaction is performed to generate new versions of the columnar files. MOR tables can reduce data ingestion latency, making asynchronous compaction that does not block ingestion meaningful.

How to Execute Compaction?

  1. Schedule only
    • Spark SQL (Set the following parameters, trigger on data write)
      hoodie.compact.inline=true
      hoodie.schedule.compact.only.inline=true
      hoodie.run.compact.only.inline=false
      hoodie.compact.inline.max.delta.commits=5 // The default value is 5, but you can adjust it based on the service scenario.

      After executing any write SQL, compaction will be triggered when the condition is met (for example, there are 5 delta log files under the same file slice).

    • Spark SQL (Set the following parameters, manually trigger once)
      hoodie.compact.inline=true
      hoodie.schedule.compact.only.inline=true
      hoodie.run.compact.only.inline=false
      hoodie.compact.inline.max.delta.commits=5 // The default value is 5, but you can adjust it based on the service scenario.

      Then manually execute SQL:

      schedule compaction on ${table_name}
    • SparkDataSource (Set the following parameters in the option, trigger on data write)

      hoodie.compact.inline=true

      hoodie.schedule.compact.only.inline=true

      hoodie.run.compact.only.inline=false

      hoodie.compact.inline.max.delta.commits=5 // The default value is 5, but you can adjust it based on the service scenario.

    • Flink (Set the following parameters in the with attribute, trigger on data write)

      compaction.async.enabled=false

      compaction.schedule.enabled=true

      compaction.delta_commits=5 // The default value is 5, but you can adjust it based on the service scenario.

  2. Run only
    • Spark SQL (Set the following parameters, manually trigger once)
      hoodie.compact.inline=true
      hoodie.schedule.compact.only.inline=false
      hoodie.run.compact.only.inline=true

      Then execute the following SQL:

      run compaction on ${table_name}
  3. Execute Schedule and Run together

    If there is no compaction plan in the timeline, it will attempt to generate and execute a compaction plan.

    • Spark SQL (Set the following parameters, trigger on data write when the condition is met)
      hoodie.compact.inline=true
      hoodie.schedule.compact.only.inline=false
      hoodie.run.compact.only.inline=false
      hoodie.compact.inline.max.delta.commits=5 // The default value is 5, but you can adjust it based on the service scenario.
    • SparkDataSource (Set the following parameters in the option, trigger on data write)

      hoodie.compact.inline=true

      hoodie.schedule.compact.only.inline=false

      hoodie.run.compact.only.inline=false

      hoodie.compact.inline.max.delta.commits=5 // The default value is 5, but you can adjust it based on the service scenario.

    • Flink (Set the following parameters in the with attribute, trigger on data write)

      compaction.async.enabled=true

      compaction.schedule.enabled=false

      compaction.delta_commits=5 // The default value is 5, but you can adjust it based on the service scenario.

  4. Recommended approach
    • Spark/Flink Streaming jobs: Execute only Schedule, and then set up a separate Spark SQL job to execute Run at regular intervals.
    • Spark batch jobs: Execute Schedule and Run together directly.

      To ensure the highest efficiency of data lake ingestion, you are advised to produce the compaction schedule plan synchronously and execute the compaction schedule plan asynchronously.