Updated on 2022-06-01 GMT+08:00

Setting a Degree of Parallelism

Scenario

A degree of parallelism (DOP) specifies the number of tasks to be executed concurrently. It determines the number of data blocks after the shuffle operation. Adjust the DOP to optimize the number of tasks, the data processed by each task, and the processing capability of the machine.

Query the CPU and memory usage. If the tasks and data are not evenly distributed among nodes, increase the DOP. Generally, set the DOP to two or three times that of the total CPUs in the cluster.

Procedure

You can use any of the following methods to set the DOP and adjust the DOP parameters according to the actual memory, CPU, data, and application logic:

  • Set the DOP parameters in the function of shuffle operations. This method has the highest preference.
    testRDD.groupByKey(24)
  • Set the spark.default.parallelism parameter in the code. This method has the second highest preference.
    val conf = new SparkConf()
    conf.set("spark.default.parallelism", 24)
  • Set the spark.default.parallelism parameter in the $SPARK_HOME/conf/spark-defaults.conf file. This method has the lowest preference.
    spark.default.parallelism    24