Updated on 2025-02-22 GMT+08:00

Hudi Clean

What Is Clean?

Cleaning is used to remove old version data files (Parquet files or log files) in Hudi tables that are no longer needed. This reduces storage pressure and improves the efficiency of list operations.

How to Execute Clean?

  1. Clean after writing data.
    • Spark SQL (Set the following parameters, trigger on data write when the condition is met)
      hoodie.clean.automatic=true
      hoodie.cleaner.commits.retained=10 // The default value is 10, but you can adjust it based on the service scenario.
    • SparkDataSource (Set the following parameters in the option, trigger on data write)

      hoodie.clean.automatic=true

      hoodie.cleaner.commits.retained=10 // The default value is 10, but you can adjust it based on the service scenario.

    • Flink (Set the following parameters in the with attribute, trigger on data write)

      clean.async.enabled=true

      clean.retain_commits=10 // The default value is 10, but you can adjust it based on the service scenario.

  2. Manually trigger clean once.
    • Spark SQL (Set the following parameters, manually trigger once)
    hoodie.clean.automatic=true
    hoodie.cleaner.commits.retained=10 // The default value is 10, but you can adjust it based on the service scenario.

    Execute the SQL. When there are more than 10 instant records in the timeline, the clean operation is triggered:

    run clean on ${table_name}