Updated on 2025-08-22 GMT+08:00

Optimizing Small Files

Scenarios

A Spark SQL table may have many small files (far smaller than an HDFS block), each of which maps to a partition on the Spark by default. In other words, each small file is a task. In this way, Spark has to start many such tasks. If a shuffle operation is involved in the SQL logic, the number of hash buckets soars, severely hindering system performance.

In case of massive number of small files, when DataSource creates an RDD, it splits small files in the Spark SQL table to PartitionedFiles and then merges the PartitionedFiles to a partition to avoid generating too many hash buckets during the shuffle operation. See Figure 1.

Figure 1 Merging small files

Procedure

  1. Install the Spark client.

    For details, see Installing a Client.

  2. Log in to the Spark client node as the client installation user.

    Modify the following parameters in the {Client installation directory}/Spark/spark/conf/spark-defaults.conf file on the Spark client.

    Table 1 Parameter description

    Parameter

    Description

    Example Value

    spark.sql.files.maxPartitionBytes

    The maximum number of bytes that can be packed into a single partition when a file is read.

    Unit: byte

    134217728 (128 MB)

    spark.files.openCostInBytes

    The estimated cost to open a file, measured by the number of bytes that could be scanned in the same time. This is used when you put multiple files into a partition. It is better to over-estimate, then the partitions with small files will be faster than partitions with bigger files (which is scheduled first).

    4M