Help Center/ Data Lake Insight/ FAQs/ Problems Related to Spark Jobs/ Job Development/ How Do I Use JDBC to Set the spark.sql.shuffle.partitions Parameter to Improve the Task Concurrency?
Updated on 2023-03-21 GMT+08:00

How Do I Use JDBC to Set the spark.sql.shuffle.partitions Parameter to Improve the Task Concurrency?

Scenario

When shuffle statements, such as GROUP BY and JOIN, are executed in Spark jobs, data skew occurs, which slows down the job execution.

To solve this problem, you can configure spark.sql.shuffle.partitions to improve the concurrency of shuffle read tasks.

Configuring spark.sql.shuffle.partitions

You can use the set clause to configure the dli.sql.shuffle.partitions parameter in JDBC. The statement is as follows:

Statement st = conn.stamte()
st.execute("set spark.sql.shuffle.partitions=20")