Configuring Automatic Merging of Small Files
Scenario
After the automatic small file merging feature is enabled, Spark writes data to the temporary directory and then checks whether the average file size of each partition is less than 16 MB (default value). If the average file size is less than 16 MB, the partition contains small files. Spark starts a job to merge these small files and writes the large files to the final table directory.
Constraints
- Only Hive and DataSource tables can be written.
- Parquet and ORC data formats are supported.
Parameter Configuration
Parameter |
Description |
Default Value |
---|---|---|
spark.sql.mergeSmallFiles.enabled |
If this parameter is set to true, Spark checks whether small files are written when writing data to the target table. If small files are found, Spark starts the file merging job. |
false |
spark.sql.mergeSmallFiles.threshold.avgSize |
If the average file size of a partition is smaller than the value of this parameter, small file merging is started. |
16 MB |
spark.sql.mergeSmallFiles.maxSizePerTask |
Target size of each file after the merging. |
256 MB |
spark.sql.mergeSmallFiles.moveParallelism |
Maximum degree of parallelism of moving temporary files to the final directory. If the number of temporary files exceed the specified value, a file merging job is triggered. |
10,000 |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot