Updated on 2025-02-22 GMT+08:00

Hudi Archive

What Is Archive?

Archiving is used to clean up metadata files in Hudi tables (located in the .hoodie directory, formatted as ${timestamp}.${operation_type}.${operation_status}, for example, 20240622143023546.deltacommit.request). Each operation on a Hudi table generates metadata files, and too many metadata files can lead to performance issues. Therefore, it is best to keep the number of metadata files under 1,000.

How to Execute Archive?

  1. Archive after writing data.
    • Spark SQL (Set the following parameters, trigger on data write)
      hoodie.archive.automatic=true
      hoodie.keep.max.commits=30 // The default value is 30, but you can adjust it based on the service scenario.
      hoodie.keep.min.commits=20 // The default value is 20, but you can adjust it based on the service scenario.
    • SparkDataSource (Set the following parameters in the option, trigger on data write)

      hoodie.archive.automatic=true

      hoodie.keep.max.commits=30 // The default value is 30, but you can adjust it based on the service scenario.

      hoodie.keep.min.commits=20 // The default value is 20, but you can adjust it based on the service scenario.

    • Flink (Set the following parameters in the with attribute, trigger on data write)

      hoodie.archive.automatic=true

      archive.max_commits=30 // The default value is 30, but you can adjust it based on the service scenario.

      archive.min_commits=20 // The default value is 20, but you can adjust it based on the service scenario.

  2. Manually trigger archive once.
    • Spark SQL (Set the following parameters, manually trigger once)
      hoodie.archive.automatic=true
      hoodie.keep.max.commits=30 // The default value is 30, but you can adjust it based on the service scenario.
      hoodie.keep.min.commits=20 // The default value is 20, but you can adjust it based on the service scenario.

      Execute the SQL. When a clean operation has been performed and the timeline contains instant records of data files that have already been cleaned, and the total number of instant records exceeds 30, the archive operation is triggered.

      run archivelog on ${table_name}