Updated on 2024-10-09 GMT+08:00

Setting Spark Core DOP

Scenario

A degree of parallelism (DOP) specifies the number of tasks to be executed concurrently. It determines the number of data blocks after the shuffle operation. Configuring the DOP will optimize the number of tasks, data volume of each task, and the host processing capability.

Query the CPU and memory usage. If data and tasks are not evenly distributed among nodes, increase the DOP for even distribution. Generally, set the DOP to two or three times that of the total CPUs in the cluster.

Procedure

You can use any of the following methods to set the DOP and adjust the DOP parameters according to the actual memory, CPU, data, and application logic:

  • Set the DOP parameters in the function of shuffle operations. This method has the highest priority.
    testRDD.groupByKey(24)
  • Set the spark.default.parallelism parameter in the code. This method has the second highest preference.
    val conf = new SparkConf()
    conf.set("spark.default.parallelism", 24)
  • Set the spark.default.parallelism parameter in the $SPARK_HOME/conf/spark-defaults.conf file. This method has the lowest preference.
    spark.default.parallelism    24