Setting a Degree of Parallelism
Scenario
A degree of parallelism (DOP) specifies the number of tasks to be executed concurrently. It determines the number of data blocks after the shuffle operation. Adjust the DOP to optimize the number of tasks, the data processed by each task, and the processing capability of the machine.
Query the CPU and memory usage. If the tasks and data are not evenly distributed among nodes, increase the DOP. Generally, set the DOP to two or three times that of the total CPUs in the cluster.
Procedure
You can use any of the following methods to set the DOP and adjust the DOP parameters according to the actual memory, CPU, data, and application logic:
- Set the DOP parameters in the function of shuffle operations. This method has the highest preference.
testRDD.groupByKey(24)
- Set the spark.default.parallelism parameter in the code. This method has the second highest preference.
val conf = new SparkConf() conf.set("spark.default.parallelism", 24) - Set the spark.default.parallelism parameter in the $SPARK_HOME/conf/spark-defaults.conf file. This method has the lowest preference.
spark.default.parallelism 24
Last Article: Memory Configuration Optimization
Next Article: Using Broadcast Variables
Did this article solve your problem?
Thank you for your score!Your feedback would help us improve the website.