Updated on 2024-08-30 GMT+08:00

Offline Compaction Configuration

For real-time services of MOR tables, compaction plans are generated during data write. Therefore, DataArts or scripts need to be used to schedule SparkSQL to execute the generated compaction plans.

  • Execution parameter
    set hoodie.compact.inline = true; // Enable compaction.
    The set hoodie.run.compact.only.inline = true; //compaction executes only the generated plans and does not generate new plans.
    set hoodie.cleaner.commits.retained = 120; // Clearing and Retaining 120 Commits
    A maximum of 140 commit records can be retained in the set hoodie.keep.max.commits = 140; // archive.
    At least 121 commit records can be retained in the set hoodie.keep.min.commits = 121; // archive.
    set hoodie.clean.async = false; // Enable asynchronous cleanup.
    set hoodie.clean.automatic = false; //Disable the automatic cleaning function to prevent the compaction operation from starting the clean operation.
    run compaction on $tablename; //Execute the compaction plan.
    run clean on $tablename; //Run the clean operation to delete redundant versions.
    run archivelog on $tablename; //Merge and clear metadata files by running archivelog.

    Do not set the clearance and archiving parameters to a large value. Otherwise, the Hudi table performance will be affected. Therefore, you are advised to:

    Two times the number of commit operations required by the hoodie.cleaner.commits.retained = compaction

    hoodie.keep.min.commits = hoodie.cleaner.commits.retained + 1

    hoodie.keep.max.commits = hoodie.keep.min.commits + 20

    Run the clean and archive commands after the compaction command is executed. The clean and archive logs have low requirements on resources. To avoid resource waste, you can configure the compaction task as a task when DataArts is used to schedule the clean and archive logs as a task and configure different resources for the clean and archive logs to save resources.

  • Execution resource
    1. The interval for scheduling compaction plans must be smaller than the interval for generating compaction plans. For example, if a compaction plan is generated about one hour, the scheduling task for executing the compaction plan must be scheduled at least every half an hour.
    2. For the resources configured for the Compaction job, the number of vcores must be at least equal to or greater than the number of buckets in a single partition. The ratio of the number of vcores to the memory must be 1:4, that is, one vcore is configured with 4 GB memory.