How Do I Use JDBC to Set the spark.sql.shuffle.partitions Parameter to Improve the Task Concurrency?
Scenario
When shuffle statements, such as GROUP BY and JOIN, are executed in Spark jobs, data skew occurs, which slows down the job execution.
To solve this problem, you can configure spark.sql.shuffle.partitions to improve the concurrency of shuffle read tasks.
Configuring spark.sql.shuffle.partitions
You can use the set clause to configure the dli.sql.shuffle.partitions parameter in JDBC. The statement is as follows:
Statement st = conn.stamte() st.execute("set spark.sql.shuffle.partitions=20")
Job Development FAQs
- How Do I Use Spark to Write Data into a DLI Table?
- How Do I Set the AK/SK for a Queue to Operate an OBS Table?
- How Do I View the Resource Usage of DLI Spark Jobs?
- How Do I Use Python Scripts to Access the MySQL Database If the pymysql Module Is Missing from the Spark Job Results Stored in MySQL?
- How Do I Run a Complex PySpark Program in DLI?
- How Does a Spark Job Access a MySQL Database?
- How Do I Use JDBC to Set the spark.sql.shuffle.partitions Parameter to Improve the Task Concurrency?
- How Do I Read Uploaded Files for a Spark Jar Job?
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.
more