Updated on 2025-11-04 GMT+08:00

VeRL

This section describes the YAML configuration file and parameters for training. You can choose parameters as required.

YAML File Configuration

Edit the YAML file using these instructions. Parameters like aaa.bbb show the value of bbb under aaa. For example, backend_config.data.train_files refers to the data.train_files parameter of backend_config.

Table 1 Model training script parameters

Parameter

Example Value

Description

af_output_dir

/home/ma-user/verl

(Mandatory) Training output result

backend_config.data.train_files

/data/geometry3k/train.parquet

(Mandatory) Training set after preprocessing

backend_config.data.val_files

/data/geometry3k/test.parquet

(Mandatory) Validation set after preprocessing

backend_config.actor_rollout_ref.model.path

/model/Qwen2.5-VL-32B-Instruct

(Mandatory) Hugging Face model path, which can be a local path or an HDFS path.

backend_config.data.train_batch_size

32

Batch size for one training sampling.

backend_config.actor_rollout_ref.actor.ppo_mini_batch_size

8

Batch size split into multiple sub-batches.

backend_config.actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu

1

Data volume for a single forward propagation when training the Actor on one device.

backend_config.actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu

1

Data volume a single device processes in one forward propagation while calculating log probabilities during rollout.

backend_config.actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu

1

Data volume each devic processes during one forward propagation for calculating the reference policy's log probabilities.

backend_config.trainer.total_epochs

5

(Optional) Number of training epochs, which you can set depending on your needs.

backend_config.actor_rollout_ref.rollout.tensor_model_parallel_size

4

Model segmentation during vLLM inference.

backend_config.data.image_key

images

(Multimodal model) Field where images in the dataset are located. The default value is 'images'.

engine_kwargs.vllm.disable_mm_preprocessor_cache

True

(Multimodal model) Specifies whether to disable the preprocessor cache of multimodal models. The default value is False.

backend_config.data.max_prompt_length

1024

Maximum prompt length. All prompts will be left-padded to this length.

backend_config.data.max_response_length

1024

Maximum length of responses Maximum generation length during the rollout phase in the RL algorithm.

backend_config.actor_rollout_ref.rollout.max_num_batched_tokens

18432

When max_response_length + max_prompt_length is greater than 8k, set this parameter to the sum of the two parameters. The default value is 8192.

backend_config.actor_rollout_ref.actor.ulysses_sequence_parallel_size

1

Sequence parallelism. This parameter is used for long sequences. The default value is 1. When the value of max_response_length is greater than 8k, the value is generally rounded off (max_response_length/2048).

backend_config.data.shuffle

True

Specifies whether to shuffle the data in dataloader.

backend_config.data.truncation

'error'

Truncates input_ids or prompt length if they exceed max_prompt_length. The default value is 'error', which means they must exceed max_prompt_length.

backend_config.actor_rollout_ref.actor.optim.lr

1e-6

Actor learning rate

backend_config.actor_rollout_ref.model.use_remove_padding

True

Specifies whether to remove padding.

  • True: Yes
  • False: No

backend_config.actor_rollout_ref.actor.use_kl_loss

True

Specifies whether to use KL loss in the actor. If yes, KL is not used in the reward function. The default value is True.

backend_config.actor_rollout_ref.actor.kl_loss_coef

0.01

KL loss coefficient. The default value is 0.001.

backend_config.actor_rollout_ref.actor.kl_loss_type

low_var_kl

Specifies how to calculate the KL divergence between the participant and the reference policy. The default value is low_var_kl.

backend_config.actor_rollout_ref.actor.entropy_coeff

0

Calculates the PPO loss. Generally, set this parameter to 0.

backend_config.actor_rollout_ref.actor.use_torch_compile

False

Specifies whether to enable JIT compilation acceleration. The value can be False.

backend_config.actor_rollout_ref.model.enable_gradient_checkpointing

True

Specifies whether to enable gradient checkpointing for the actor.

backend_config.actor_rollout_ref.rollout.name

vllm

Inference framework name vllm.

backend_config.actor_rollout_ref.rollout.gpu_memory_utilization

0.4

Ratio of the total device memory to the vLLM instances.

backend_config.actor_rollout_ref.rollout.n

4

Repeats a batch of data for n times (interleaving is performed during the repetition).

backend_config.trainer.logger

['console','tensorboard']

Log backend. The options are wandb, console, and tensorboard.

backend_config.trainer.val_before_train

False

Specifies whether to run the validation set test before training.

  • True: Yes
  • False: No

backend_config.trainer.resume_mode

auto

The default value is auto. Set this parameter to auto to load the most recent training weights from the default save location if training stops unexpectedly. For resumable training, change the parameter to resume_path and specify resume_from_path. Set it to disable to disable resumable training.

backend_config.trainer.resume_from_path

null

The default value is null. Set this parameter to specify the path to load weight parameters for resumable training.

backend_config.trainer.save_freq

-1

Frequency of saving model weight parameters (by iteration). The default value is -1, indicating that weight parameters are not saved.

backend_config.actor_rollout_ref.actor.ppo_max_token_len_per_gpu

2048

Maximum number of tokens that can be processed by a single GPU in a PPO micro batch size. Generally, set this parameter to n x (data.max_prompt_length + data.max_response_length).