Updated on 2024-11-29 GMT+08:00

Compaction

Introduction to Compaction

A compaction merges base and log files of MOR tables.

For MOR tables, data is stored in columnar Parquet files and row-based Avro files, updates are recorded in incremental files, and then a synchronous or asynchronous compaction is performed to generate new versions of columnar files. MOR tables can reduce data ingestion latency, so an asynchronous compaction that does not block ingestion is useful.

Using Compaction

Compaction consists of two steps:
  1. Generate a compaction scheduling plan. Hudi scans partitions, select the file slices to be compacted, and writes the timeline of Hudi in the compaction plan.
  2. Execute the compaction plan. Read the compaction plan and perform compaction on file slices.
Compactions can be synchronously or asynchronously performed, which is controlled by the hoodie.compact.inline parameter. The default value is true.
  • In synchronous mode, a compaction scheduling plan is automatically generated and compactions are executed.
    1. Disable synchronous compactions.

      When a data source is written, run the .option("hoodie.compact.inline", "false") command to disable automatic compaction.

      When spark-sql is written, run the set hoodie.compact.inline=false; command to disable automatic compaction.

    2. Only compaction scheduling is generated synchronously, but compaction is not executed.
      • · A data source can be written by configuring the following option parameters:

        option("hoodie.compact.inline", "true").

        option("hoodie.schedule.compact.only.inline", "true").

        option("hoodie.run.compact.only.inline", "false").

      • · spark-sql can be written by configuring the following set parameters:

        set hoodie.compact.inline=true;

        set hoodie.schedule.compact.only.inline=true;

        set hoodie.run.compact.only.inline=false;

  • The asynchronous mode is implemented by spark-sql.

    To execute only the compaction scheduling plan that has been generated during asynchronous compaction without creating a new scheduling plan, run the following commands to configure set parameters:

    set hoodie.compact.inline=true;

    set hoodie.schedule.compact.only.inline=false;

    set hoodie.run.compact.only.inline=true;

    For more compaction parameters, see Compaction and Cleaning Configurations.

    To ensure the maximum efficiency of data import into the lake, you are advised to generate compaction scheduling plans synchronously and execute compaction scheduling plans asynchronously.