VeRL

This section describes the YAML configuration file and parameters for training. You can choose parameters as required.

YAML File Configuration

Edit the YAML file using these instructions. Parameters like aaa.bbb show the value of bbb under aaa. For example, backend_config.data.train_files refers to the data.train_files parameter of backend_config.

**Table 1** Model training script parameters
Parameter	Example Value	Description
af_output_dir	/home/ma-user/verl	(Mandatory) Training output result
backend_config.data.train_files	/data/geometry3k/train.parquet	(Mandatory) Training set after preprocessing
backend_config.data.val_files	/data/geometry3k/test.parquet	(Mandatory) Validation set after preprocessing
backend_config.actor_rollout_ref.model.path	/model/Qwen2.5-VL-32B-Instruct	(Mandatory) Hugging Face model path, which can be a local path or an HDFS path.
backend_config.data.train_batch_size	32	Batch size for one training sampling.
backend_config.actor_rollout_ref.actor.ppo_mini_batch_size	8	Batch size split into multiple sub-batches.
backend_config.actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu	1	Data volume for a single forward propagation when training the Actor on one device.
backend_config.actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu	1	Data volume a single device processes in one forward propagation while calculating log probabilities during rollout.
backend_config.actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu	1	Data volume each devic processes during one forward propagation for calculating the reference policy's log probabilities.
backend_config.trainer.total_epochs	5	(Optional) Number of training epochs, which you can set depending on your needs.
backend_config.actor_rollout_ref.rollout.tensor_model_parallel_size	4	Model segmentation during vLLM inference.
backend_config.data.image_key	images	(Multimodal model) Field where images in the dataset are located. The default value is 'images'.
engine_kwargs.vllm.disable_mm_preprocessor_cache	True	(Multimodal model) Specifies whether to disable the preprocessor cache of multimodal models. The default value is False.
backend_config.data.max_prompt_length	1024	Maximum prompt length. All prompts will be left-padded to this length.
backend_config.data.max_response_length	1024	Maximum length of responses Maximum generation length during the rollout phase in the RL algorithm.
backend_config.actor_rollout_ref.rollout.max_num_batched_tokens	18432	When max_response_length + max_prompt_length is greater than 8k, set this parameter to the sum of the two parameters. The default value is 8192.
backend_config.actor_rollout_ref.actor.ulysses_sequence_parallel_size	1	Sequence parallelism. This parameter is used for long sequences. The default value is 1. When the value of max_response_length is greater than 8k, the value is generally rounded off (max_response_length/2048).
backend_config.data.shuffle	True	Specifies whether to shuffle the data in dataloader.
backend_config.data.truncation	'error'	Truncates input_ids or prompt length if they exceed max_prompt_length. The default value is 'error', which means they must exceed max_prompt_length.
backend_config.actor_rollout_ref.actor.optim.lr	1e-6	Actor learning rate
backend_config.actor_rollout_ref.model.use_remove_padding	True	Specifies whether to remove padding. True: Yes False: No
backend_config.actor_rollout_ref.actor.use_kl_loss	True	Specifies whether to use KL loss in the actor. If yes, KL is not used in the reward function. The default value is True.
backend_config.actor_rollout_ref.actor.kl_loss_coef	0.01	KL loss coefficient. The default value is 0.001.
backend_config.actor_rollout_ref.actor.kl_loss_type	low_var_kl	Specifies how to calculate the KL divergence between the participant and the reference policy. The default value is low_var_kl.
backend_config.actor_rollout_ref.actor.entropy_coeff	0	Calculates the PPO loss. Generally, set this parameter to 0.
backend_config.actor_rollout_ref.actor.use_torch_compile	False	Specifies whether to enable JIT compilation acceleration. The value can be False.
backend_config.actor_rollout_ref.model.enable_gradient_checkpointing	True	Specifies whether to enable gradient checkpointing for the actor.
backend_config.actor_rollout_ref.rollout.name	vllm	Inference framework name vllm.
backend_config.actor_rollout_ref.rollout.gpu_memory_utilization	0.4	Ratio of the total device memory to the vLLM instances.
backend_config.actor_rollout_ref.rollout.n	4	Repeats a batch of data for n times (interleaving is performed during the repetition).
backend_config.trainer.logger	['console','tensorboard']	Log backend. The options are wandb, console, and tensorboard.
backend_config.trainer.val_before_train	False	Specifies whether to run the validation set test before training. True: Yes False: No
backend_config.trainer.resume_mode	auto	The default value is auto. Set this parameter to auto to load the most recent training weights from the default save location if training stops unexpectedly. For resumable training, change the parameter to resume_path and specify resume_from_path. Set it to disable to disable resumable training.
backend_config.trainer.resume_from_path	null	The default value is null. Set this parameter to specify the path to load weight parameters for resumable training.
backend_config.trainer.save_freq	-1	Frequency of saving model weight parameters (by iteration). The default value is -1, indicating that weight parameters are not saved.
backend_config.actor_rollout_ref.actor.ppo_max_token_len_per_gpu	2048	Maximum number of tokens that can be processed by a single GPU in a PPO micro batch size. Generally, set this parameter to n x (data.max_prompt_length + data.max_response_length).