Help Center/ MapReduce Service/ Component Development Specifications/ Hudi/ Spark on Hudi Development Specifications/ Suggestions on configuring resources for Spark read and write Hudi resources
Updated on 2024-08-30 GMT+08:00

Suggestions on configuring resources for Spark read and write Hudi resources

  • According to the resource configuration rules for the Hudi task of Spark, the ratio of memory to CPU cores is 2:1, and the ratio of off-heap memory to CPU cores is 0.5:1. That is, one core, requiring 2 GB heap memory and 0.5 GB non-heap memory.

    In the Spark initialization and import scenario, the preceding resource ratio needs to be adjusted because the amount of data to be processed is large. The recommended ratio of memory to core is 4:1 and the ratio of non-heap memory to core is 1:1.

    Example:

    spark-submit
    --master yarn-cluster
    --executor-cores 2 //Core
    --executor-memory 4g //Heap memory
    --conf spark.executor.memoryOverhead=1024 // off-heap memory
  • Spark-based ETL calculation: The recommended ratio of CPU core to memory is greater than 1:2, and the recommended ratio is from 1:4 to 1:8.

    The previous rule refers to the resource ratio of pure read and write. If Spark jobs have service logic calculation in addition to read and write, this process will cause memory increase. Therefore, it is recommended that the ratio of CPU cores to memory be greater than 1:2. If the logic is complex, increase the memory. This should be adjusted based on the actual situation. Generally, the default value range is 1:4 to 1:8.

  • It is recommended that the number of CPU cores be greater than or equal to the number of buckets. (The partition table may be written to multiple partitions each time. In ideal conditions, the recommended number of CPU cores = Number of write partitions x Number of buckets. If the actual number of cores is less than the value, the write performance decreases linearly.)

    Example:

    The number of buckets in the current table is three, and the number of partitions that are written to the table is two. It is recommended that the number of cores configured for the Spark import task be greater than or equal to 3 x 2.

    spark-submit
    --master yarn-cluster
    --executor-cores 2
    --executor-memory 4g
    --excutor-num 3

    The preceding configuration indicates that the number of excutor-num*executor-cores=6 >= partitions multiplied by the number of buckets is 6.