Updated on 2022-09-22 GMT+08:00

Compaction

A compaction merges base and log files of MOR tables.

For MOR tables, data is stored in columnar Parquet files and row-based Avro files, updates are recorded in incremental files, and then a synchronous or asynchronous compaction is performed to generate new versions of columnar files. MOR tables can reduce data ingestion latency, so an asynchronous compaction that does not block ingestion is useful.

  • An asynchronous compaction is performed in the following two steps:
    1. Scheduling a compaction: A compaction is completed by the job of importing data into the data lake. In this step, Hudi scans partitions and selects the file slices to be compacted. A compaction plan is finally written to the Hudi timeline.
    2. Executing a compaction: A separate process or thread reads the compaction plan and performs the compaction of file slices.
  • Compaction can be synchronous or asynchronous.
    • The synchronization mode is controlled by the hoodie.compact.inline parameter. The default value is true, indicating that the compaction scheduling plan is automatically generated and compaction is executed.
      • Disable synchronous compaction.

        When a data source is written, run the .option("hoodie.compact.inline", "false") command to disable automatic compaction.

        When spark-sql is written, run the set hoodie.compact.inline=false; command to disable automatic compaction.

      • Only compaction scheduling is generated synchronously, but compaction is not executed.
        • · A data source can be written by configuring the following option parameters:

          option("hoodie.compact.inline", "true").

          option("hoodie.schedule.compact.only.inline", "true").

          option("hoodie.run.compact.only.inline", "false").

        • · spark-sql can be written by configuring the following set parameters:

          set hoodie.compact.inline=true;

          set hoodie.schedule.compact.only.inline=true;

          set hoodie.run.compact.only.inline=false;

    • The asynchronous mode is implemented by spark-sql.

      To execute only the compaction scheduling plan that has been generated during asynchronous compaction without creating a new scheduling plan, run the following commands to configure set parameters:

      set hoodie.compact.inline=true;

      set hoodie.schedule.compact.only.inline=false;

      set hoodie.run.compact.only.inline=true;

      For more compaction parameters, see Compaction and Cleaning Configurations.

      To ensure the maximum efficiency of data import into the lake, you are advised to generate compaction scheduling plans synchronously and execute compaction scheduling plans asynchronously.