Updated on 2024-05-29 GMT+08:00

Configuring Automatic Combination of Small Files

Scenario

After the automatic small file combination feature is enabled, Spark writes data to the temporary directory and then checks whether the average file size of each partition is less than 16 MB (default value). If the average file size is less than 16 MB, the partition contains small files. Spark starts a job to combine these small files and writes the combined large files to the final table directory.

Constraints

  • Type of the table to be written: Hive and Datasource
  • Parquet and ORC data formats are supported.

Parameters

Modify the following parameters in the Client installation directory/Spark/spark/conf/spark-defaults.conf file on the Spark client.

Parameter

Description

Default Value

spark.sql.mergeSmallFiles.enabled

If this parameter is set to true, Spark checks whether small files are written when writing data to the target table. If small files are found, Spark starts the job for merging small files.

false

spark.sql.mergeSmallFiles.threshold.avgSize

If the average file size of a partition is smaller than the value of this parameter, small file merging is started.

16MB

spark.sql.mergeSmallFiles.maxSizePerTask

Target size of each file after combination.

256MB

spark.sql.mergeSmallFiles.moveParallelism

Degree of parallelism for moving temporary files to the final directory when small files do not need to be merged.

10000