Help Center/ MapReduce Service/ Component Development Specifications/ Hudi/ Spark on Hudi Development Specifications/ Suggestions for Configuring Resources for Hudi Data Read and Write
Updated on 2025-04-15 GMT+08:00

Suggestions for Configuring Resources for Hudi Data Read and Write

  • Resource configuration rules for Spark to read and write Hudi tasks: The ratio of memory to CPU cores is 2:1, and the ratio of off-heap memory to CPU cores is 0.5:1. That is, one core requires 2 GB heap memory and 0.5 GB off-heap memory.

    During the Spark initialization, the above resource ratio needs to be adjusted due to a large amount of data being processed. The recommended ratio of memory to cores is 4:1, and the ratio of off-heap memory to cores is 1:1.

    Example:

    spark-submit
    --master yarn-cluster
    --executor-cores  2                          --number of cores
    --executor-memory 4g                         --memory size
    --conf spark.executor.memoryOverhead=1024    --off-heap memory size
  • For ETL calculations based on Spark, it is recommended that the ratio of CPU cores to memory be greater than 1:2, preferably ranging from 1:4 to 1:8.

    The previous rule is the resource ratio for pure read and write operations. When Spark jobs involve both read and write along with business logic computation, this process leads to increased memory needs. Hence, it is suggested that the CPU core-to-memory ratio should be more than 1:2. If the logic is intricate, adjust the memory accordingly, taking into account the actual circumstances. The recommended configuration is typically set at a ratio of 1:4 to 1:8.

  • For the write resource configuration of the bucket table, it is recommended that the number of CPU cores be at least equal to the number of buckets. Ideally, the suggested number of CPU cores should be calculated as follows: Number of CPU cores = Number of write partitions x Number of buckets. If the actually configured core count is less than this value, the write performance decreases linearly.

    Example:

    The current table has 3 buckets, and there are 2 partitions to be written simultaneously. It is recommended that the number of cores configured for the Spark import task be greater than or equal to 6.

    spark-submit
    --master yarn-cluster
    --executor-cores 2
    --executor-memory 4g
    --excutor-num 3

    The provided configuration signifies that executor-num x executor-cores = 6 ≥ Number of partitions x Number of buckets =6.