Optimizing Spark SQL Performance in the Small File Scenario
Scenarios
A Spark SQL table may have many small files (far smaller than an HDFS block), each of which maps to a partition on the Spark by default. In other words, each small file is a task. If the small files are great in number, Spark must initiate a large number of tasks. If shuffle operations exist in Spark SQL, the number of hash buckets increases, affecting performance.
In this scenario, you can manually specify the split size of each task to avoid an excessive number of tasks and improve performance.
Configuration
- Install the Spark client.
For details, see Installing a Client.
- Log in to the Spark client node as the client installation user.
Modify the following parameters in the Client installation directory/Spark/spark/conf/spark-defaults.conf file on the Spark client to enable small file optimization:
If the SQL logic does not involve shuffle operations, this optimization does not improve performance.
Table 1 Parameter description Parameter
Description
Example Value
spark.sql.files.maxPartitionBytes
The maximum number of bytes that can be packed into a single partition when a file is read.
Unit: byte
134217728 (128 MB)
spark.files.openCostInBytes
The estimated cost to open a file, measured by the number of bytes that could be scanned in the same time. This is used when you put multiple files into a partition. It is better to over-estimate, then the partitions with small files will be faster than partitions with bigger files (which is scheduled first).
4 MB
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot