Data Skewness Optimization
In the Spark SQL multi-table join scenario, severe association skewness may occur. As a result, data in some buckets is far more than that in others after data distribution by using Hash. In this case, some tasks are overloaded and run slowly while other tasks are light and run fast. Heavy tasks run slowly hindering computing performance and light tasks will result in idle CPUs, wasting CPU resources.
If there is data skewness, you are advised to configure the spark.sql.adaptive.skewjoin.threshold parameter to enable data skewness optimization and view data volumes of buckets. If the data volume in one bucket is too large and data skewness occurs, split the bucket and process skewed data with multiple tasks. Each task pulls full data in buckets with same join tables, improving CPU resource usage and overall performance.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot