Configuring Automatic Combination of Small Files
Scenario
After the automatic small file combination feature is enabled, Spark writes data to the temporary directory and then checks whether the average file size of each partition is less than 16 MB (default value). If the average file size is less than 16 MB, the partition contains small files. Spark starts a job to combine these small files and writes the combined large files to the final table directory.
Constraints
- Type of the table to be written: Hive and Datasource
- Parquet and ORC data formats are supported.
Parameters
Parameter |
Description |
Default Value |
---|---|---|
spark.sql.mergeSmallFiles.enabled |
If this parameter is set to true, Spark checks whether small files are written when writing data to the target table. If small files are found, Spark starts the job for merging small files. |
false |
spark.sql.mergeSmallFiles.threshold.avgSize |
If the average file size of a partition is smaller than the value of this parameter, small file merging is started. |
16MB |
spark.sql.mergeSmallFiles.maxSizePerTask |
Target size of each file after combination. |
256MB |
spark.sql.mergeSmallFiles.moveParallelism |
Degree of parallelism for moving temporary files to the final directory when small files do not need to be merged. |
10000 |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot