Offline Compaction Configuration

For real-time operations of MOR tables, it is common to synchronously generate compaction plans during writing. Therefore, additional scheduling using DataArts Studio or scripts to execute the already generated compaction plans with SparkSQL is needed.

Execution parameters

set hoodie.compact.inline = true;                -- Enable compaction.
set hoodie.run.compact.only.inline = true;       -- Compaction only executes the generated plans without creating new plans.
set hoodie.cleaner.commits.retained = 120;       -- Clear commits and retain 120 commits.
set hoodie.keep.max.commits = 140;               -- Maximum of 140 commits to be archived
set hoodie.keep.min.commits = 121;               -- Minimum of 121 commits to be archived
set hoodie.clean.async = false;                  -- Disable asynchronous cleaning.
set hoodie.clean.automatic = false;              -- Disable automatic cleaning to prevent compaction operations from triggering clean.
set hoodie.archive.async = false;                -- Disable asynchronous archiving.
set hoodie.archive.automatic = false;            -- Disable automatic archiving.
 
run compaction on $tablename;                    -- Execute the compaction plan.
run clean on $tablename;                         -- Execute the clean operation to remove redundant versions.
run archivelog on $tablename;                    -- Execute archivelog to merge and clean metadata files.

The values for cleanup and archival parameters should not be set too high, as it can affect the performance of Hudi tables. It is generally recommended:
hoodie.cleaner.commits.retained = 2 x Number of commits required for compaction

hoodie.keep.min.commits = hoodie.cleaner.commits.retained + 1

hoodie.keep.max.commits = hoodie.keep.min.commits + 20
Execute clean and archive after compaction. Since clean and archivelog require fewer resources, to avoid resource waste, you can use DataArts Studio to schedule compaction as one task, and clean and archive as another task with different resources.

Execution resources
1. The interval for scheduling compaction should be less than the interval for generating compaction plans. For example, if a compaction plan is generated approximately every hour, the scheduling task for executing the compaction plan should be scheduled at least once every half hour.
2. The resources configured for compaction jobs should have a number of vCPUs at least equal to the number of buckets in a single partition. The ratio of vCPUs to memory should be 1:4, that is, 1 vCPU with 4 GB of memory.