Updated on 2024-03-18 GMT+08:00

Data Skewness Optimization

In the Spark SQL multi-table join scenario, severe association skewness may occur. As a result, data in some buckets is far more than that in others after data distribution by using Hash. In this case, some tasks are overloaded and run slowly while other tasks are light and run fast. Heavy tasks run slowly hindering computing performance and light tasks will result in idle CPUs, wasting CPU resources.

If there is data skewness, you are advised to configure the spark.sql.adaptive.skewjoin.threshold parameter to enable data skewness optimization and view data volumes of buckets. If the data volume in one bucket is too large and data skewness occurs, split the bucket and process skewed data with multiple tasks. Each task pulls full data in buckets with same join tables, improving CPU resource usage and overall performance.

Figure 1 Converting skewed data join