Configuring Automatic Merging of Small Files in Spark
Scenarios
After the automatic small file merging feature is enabled, Spark writes data to the temporary directory and then checks whether the average file size of each partition is less than 16 MB (default value). If the average file size is less than 16 MB, the partition contains small files. Spark starts a job to merge these small files and writes the large files to the final table directory.
Constraints
- Only Hive and DataSource tables can be written.
- Parquet and ORC data formats are supported.
Parameters
- Install the Spark client.
For details, see Installing a Client.
- Log in to the Spark client node as the client installation user.
Modify the following parameters in the {Client installation directory}/Spark/spark/conf/spark-defaults.conf file on the Spark client.
Parameter
Description
Example Value
spark.sql.mergeSmallFiles.enabled
Whether Spark should automatically merge small files when writing data to a table.
- true: Spark checks for the presence of small files during data writes to a target table. If small files are detected, Spark initiates a file merging job to consolidate these smaller files into larger ones.
- false: The automatic small file merging function is disabled.
true
spark.sql.mergeSmallFiles.threshold.avgSize
Threshold for the average file size within a partition.
When automatic small file merging function is enabled (spark.sql.mergeSmallFiles.enabled=true), and the average file size of a partition falls below this configured value, Spark initiates a small file merging process for that partition.
16 MB
spark.sql.mergeSmallFiles.maxSizePerTask
Target size of each file after the merging.
This parameter controls the maximum target size of a merged file produced by a task. When the automatic small file merging function is enabled (spark.sql.mergeSmallFiles.enabled=true), this parameter dictates the upper limit for the size of individual files created during this merging process by a single task, ensuring efficient and stable merging.
256 MB
spark.sql.mergeSmallFiles.moveParallelism
Parallelism of file moving operations during small file merging. Maximum degree of parallelism of moving temporary files to the final directory. If the number of temporary files exceed the specified value, a file merging job is triggered.
10000
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot