Configuration Optimization

This section explains standard optimizations and recommendations for various frameworks. Adjust settings as needed depending on your model, data, and available resources during implementation.

MindSpeed-LLM

1. Parallel policy optimization: Use model parallelism or long sequence parallelism when dealing with large model weights or lengthy sequences. This reduces video memory.

Key Parameters
- tensor-model-parallel-size (TP): model parallel size. A larger value indicates lower video memory usage and higher communication overhead.
- pipeline-model-parallel-size (PP): pipeline parallel size. A larger value indicates lower video memory usage and higher communication overhead.
- context-parallel-size (CP): sequence parallel size. A larger value indicates lower video memory usage and higher communication overhead.
- expert-model-parallel-size (EP): expert parallel size. A larger value indicates lower video memory usage and higher communication overhead.
- data-parallel-size (DP): The value is calculated using the following formula: Total number of training PUs/ (tensor-model-parallel-size x pipeline-model-parallel-size x context-parallel-size).

Policy adjustment: Using model or long sequence parallelism adds extra communication costs and lowers training speed. Communication costs typically rise in this order: pipeline parallelism > data parallelism > expert parallelism > context parallelism > tensor parallelism. Adjust your parallel policy to find the best balance between video memory usage and training efficiency.

2. Video memory optimization:

use-distributed-optimizer: After it is enabled, the optimizer status is split into each DP group to reduce the video memory consumption.
recompute-num-layers: Specifies how many layers should fully recompute. When video memory is low, this saves only the input activations of each Transformer layer or group and recalculates the rest.
recompute-granularity: Set it to full to perform full activation recomputation. Use this parameter together with the recompute-num-layers parameter.
recompute-method: When set to uniform, it divides Transformer layers into equal-sized groups based on the recompute-num-layers value and stores inputs and activations per group. When set to block, it recalculates only the first recompute-num-layers Transformer layers, leaving the rest unchanged. Set this parameter to block if you have enough video memory to boost training speed.
recompute-activation-function: Enables activation function recomputation.
recompute-activation-function-num-layers: Specifies the number of layers for activation function recomputation. Note: Activation function recomputation can work with weight computation. If both are enabled, --recompute-method can only be set to block. In this case, full recomputation and activation function recomputation run independently per layer. A single layer cannot handle both types of recomputation simultaneously.
swap-attention: When it is enabled, the system stores activation values in both device and CPU memory. It preloads these values from CPU during gradient backpropagation to avoid recalculating them. This approach uses H2D high bandwidth effectively, boosts storage and computation through networking, increases MFU, and speeds up training for foundation models.

3. Communication optimization:

overlap-grad-reduce: When it is enabled, the system uses pipeline technology within the DP group to overlap both computations and communications.
overlap-param-gather: Enabling it allows parameter gathering to overlap within the DP group. Ensure that both --use-distributed-optimizer and --overlap-grad-reduce are enabled.
use-ascend-coc: When you enable it, it breaks down MatMul operations in ColumnParallelLinear and RowParallelLinear into smaller steps. It also splits nearby communication tasks—AllReduce when sequence parallelism is off, or AllGather and ReduceScatter when it is on—into finer parts within the TP group. This allows for better overlap.

LlaMA-Factory

ZeRO parallelism (backend_config.training.deepspeed): Memory usage changes based on the number of parameters in a model. Moving from ZeRO-Stage 0 to ZeRO-Stage 3 and enabling offload reduces memory usage significantly. However, this often increases communication and computation costs, potentially slowing down training.
Configurations like ds_config_zero0.json, ds_config_zero1.json, ds_config_zero2.json, ds_config_zero2_offload.json, ds_config_zero3.json, and ds_config_zero3_offload.json are arranged from highest to lowest in terms of performance and memory usage.
Recomputation (backend_config.training.recompute_layers_ratio): Recomputation optimizes video memory by trading off performance.
If you have enough video memory and need better performance, set disable_gradient_checkpointing to true. If video memory is low, set it to false instead. Use recompute_layers_ratio to find the right balance between memory savings and computation costs. Higher values save more memory but reduce performance.
Batch size of single-device training (backend_config.training.per_device_train_batch_size): Set per_device_train_batch_size to adjust training efficiency. Higher values boost performance but require more video memory. Setting it to 1 minimizes memory usage.

VeRL

Long sequence optimization: Set data.max_prompt_length and data.max_response_length to high values for long sequence training. If your model's max response length is 16k, set data.max_response_length to 16384. Note: Too large a value can lead to low video memory. Adjust the sequence length carefully alongside other settings like sequence parallelism.
actor_rollout_ref.actor.ulysses_sequence_parallel_size splits long sequences into smaller blocks for parallel processing.
- Adjustment rule: Set this value higher for longer sequences. For example, if max_response_length is 16384, set ulysses_sequence_parallel_size to 8 to split the sequence into eight parts.
- Key parameter: To avoid exceeding video memory limits, calculate ppo_max_token_len_per_gpu as max_response_length divided by ulysses_sequence_parallel_size. This ensures each device handles only its assigned portion of the sequence tokens during a PPO batch.
Batch processing parameter optimization
- data.train_batch_size: Reduce it to fit within the video memory limits for longer sequences.
- actor_rollout_ref.rollout.n: number of responses generated for each prompt during inference. The larger the value, the higher the video memory usage.
- actor_rollout_ref.actor.ppo_mini_batch_size: The minimal configuration requires multiplying actor_rollout_ref.rollout.n by ppo_mini_batch_size to determine the number of PUs.
- actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu, actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu, and actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu: These parameters control the micro-batch size on a single device. The micro-batch size must be small enough to avoid video memory overflow. The smallest value allowed is 1.

MindSpeed-RL

Inference phase parameter configuration
- Parallelism parameters
  - infer_tensor_parallel_size: tensor parallel size. A larger value indicates lower video memory usage and higher communication overhead.
  - infer_pipeline_parallel_size: pipeline parallel size. A larger value indicates lower video memory usage and higher communication overhead.
  - infer_expert_parallel_size: expert parallel size. This parameter is used for MOE models. For DeepSeek models, large-scale EP is recommended.
- VLLM engine parameters
  - max_num_seqs: maximum number of sequences processed each time
  - max_model_len: maximum output sentence length
  - max_num_batched_tokens: maximum number of batched tokens in each iteration
  - gpu_memory_utilization: reserved video memory space for model weights and KV cache
  - enforce_eager: Setting this parameter to True disables the graph mode.
Training phase parameter configuration
- Parallelism parameters
  - tensor_model_parallel_size: tensor parallel size. A larger value indicates lower video memory usage and higher communication overhead.
  - pipeline_model_parallel_size: pipeline parallel size. A larger value indicates lower video memory usage and higher communication overhead.
  - expert_model_parallel_size: expert parallel size
  - context_parallel_size: sequence parallel size. A larger value indicates lower video memory usage and higher communication overhead.
- Video memory optimization:
  swap_optimizer: Moves the optimizer state to host memory during forward and backward propagation, keeping only its logical view on the device. It reloads the state to the device for updates, lowering peak video memory consumption.
- Batch processing parameters:
  - global_batch_size: amount of prompt data processed in a step
  - mini_batch_size: The value of global_batch_size divided by mini_batch_size indicates the number of times that the actor model is updated in each step.
  - micro_batch_size: micro batch size in the training phase. This parameter does not take effect when use_dynamic_bsz is enabled.
- RL parameters:
  - use_integrated_worker: Set this parameter to true to enable integrated PUs.
  - use_dynamic_bsz: Enable it to use dynamic batching to keep sample counts consistent across DP domains and align data lengths closely.
  - max_packing_token_size: Specifies maximum number of tokens in a batch when dynamic batching is used.
  - blocking: Specifies whether to enable asynchronous processing. The default value is true.
  - use_remove_padding: Removes padding tokens to avoid invalid calculations.
  - n_samples_per_prompt: Specifies number of responses generated for each prompt.

MindSpeed-MM

1. Parallel policy optimization: Use model parallelism or long sequence parallelism when dealing with large model weights or lengthy sequences. This reduces video memory.

Key Parameters
- tensor-model-parallel-size (TP): model parallel size. A larger value indicates lower video memory usage and higher communication overhead.
- pipeline-model-parallel-size (PP): pipeline parallel size. A larger value indicates lower video memory usage and higher communication overhead.
- context-parallel-size (CP): sequence parallel size. A larger value indicates lower video memory usage and higher communication overhead.
- data-parallel-size (DP): The value is calculated using the following formula: Total number of training PUs/ (tensor-model-parallel-size x pipeline-model-parallel-size x context-parallel-size).

Policy adjustment: Using model or long sequence parallelism adds extra communication costs and lowers training speed. Communication costs typically rise in this order: pipeline parallelism > data parallelism > context parallelism > tensor parallelism. Adjust your parallel policy to find the best balance between video memory usage and training efficiency.

2. Video memory optimization:

use-distributed-optimizer: After it is enabled, the optimizer status is split into each DP group to reduce the video memory consumption.
recompute-num-layers: Specifies how many layers should fully recompute. When video memory is low, this saves only the input activations of each Transformer layer or group and recalculates the rest.
recompute-granularity: Set it to full to perform full activation recomputation. Use this parameter together with the recompute-num-layers parameter.
recompute-method: When set to uniform, it divides Transformer layers into equal-sized groups based on the recompute-num-layers value and stores inputs and activations per group. When set to block, it recalculates only the first recompute-num-layers Transformer layers, leaving the rest unchanged. Set this parameter to block if you have enough video memory to boost training speed.
recompute-activation-function: Enables activation function recomputation.
recompute-activation-function-num-layers: Specifies the number of layers for activation function recomputation. Note: Activation function recomputation can work with weight computation. If both are enabled, --recompute-method can only be set to block. In this case, full recomputation and activation function recomputation run independently per layer. A single layer cannot handle both types of recomputation simultaneously.
swap-attention: When it is enabled, the system stores activation values in both device and CPU memory. It preloads these values from CPU during gradient backpropagation to avoid recalculating them. This approach uses H2D high bandwidth effectively, boosts storage and computation through networking, increases MFU, and speeds up training for foundation models.

3. Communication optimization:

overlap-grad-reduce: When it is enabled, the system uses pipeline technology within the DP group to overlap both computations and communications.
overlap-param-gather: Enabling it allows parameter gathering to overlap within the DP group. Ensure that both --use-distributed-optimizer and --overlap-grad-reduce are enabled.
use-ascend-coc: When you enable it, it breaks down MatMul operations in ColumnParallelLinear and RowParallelLinear into smaller steps. It also splits nearby communication tasks—AllReduce when sequence parallelism is off, or AllGather and ReduceScatter when it is on—into finer parts within the TP group. This allows for better overlap.

Parent topic: Configuration Optimization and Fault Recovery

Previous topic: Configuration Optimization and Fault Recovery

Next topic: Resumable Training