Updated on 2023-04-28 GMT+08:00

Cleaning

Cleaning is used to delete data of versions that are no longer required.

Hudi uses the cleaner working in the background to continuously delete unnecessary data of old versions. You can configure hoodie.cleaner.policy and hoodie.cleaner.commits.retained to use different cleaning policies and determine the number of saved commits.

You can use either of the following methods to perform cleaning:

  • Synchronous cleaning is controlled by the hoodie.clean.automatic parameter, which is automatically enabled by default.

    Disable synchronous cleaning:

    When a data source is written, you can use .option("hoodie.clean.automatic", "false") to disable automatic cleaning.

    When spark-sql is written, you can use set hoodie.clean.automatic=false; to disable automatic cleaning.

  • You can use spark-sql to perform asynchronous cleaning. For details, see CLEAN.

For more cleaning parameters, see Compaction and Cleaning Configurations.