Viewing Training Output Results

Viewing Logs and Weights

The MindSpeed-LLM framework prints its training loss and performance logs on the last rank node, while the MindSpeed-RL, VeRL, and Llama-Factory frameworks do so on the first rank node. The following shows the training result structure:

MindSpeed-LLM

|──{af_output_dir} #{af_output_dir} Parameter value, for example, parameter configuration in the YAML file
    # Automatically generated data directory structure
    |──preprocess_data                        # Data preprocessing directory
    |──converted_weight_TP${TP}PP${PP}        # Directory of weights converted from the Hugging Face format to the Megatron format
    |──ckpt_converted_mg2hf                   # Weights converted from the Megatron format to the Hugging Face format after training
    |──saved_checkpoints                      # Megatron weights after training
    |──training_loss.png                      # Loss curve
    |──exp-config.yaml                        # Training script (generated by the ascendfactory-cli config tool)
    |──logs                                   # Training logs
      |──xx-xx-<Timestamp>-npu_info-R${RankID}.txt          # Training video RAM monitoring logs
      |──xx-xx-<Timestamp>-run_log-${Nodes}-${RankID}.txt   # Training run logs
    |──......

Llama-Factory

|──{output_dir} #{output_dir} Parameter value, for example, parameter configuration in the YAML file
  # Automatically generated data directory structure
  |──model-000xx-of-000xx.safetensors      # Weight file
  |──trainer_log.jsonl                     # Loss values in the training JSONL file
  |──training_loss.png                     # Loss curve
  |──lora_merged                           # Default weight path after LoRA fine-tuning
  |──logs                                  # Training logs
    |──xx-xx-<Timestamp>-npu_info-R${RankID}_WS${world_size}.txt # Video memory monitoring logs
    |──xx-xx-<Timestamp>-run_log-R${RankID}_WS${world_size}.txt # Run logs
  |──......

VeRL

|──{af_output_dir}                                   # Log output directory
    # Automatically generated data directory structure
    |──exp-config.yaml                               # Training configuration file
    |──training_loss.png                             # Reward and response curve image
    |──plog                                          # Model operator logs
      |──rank*
    |──logs                                          # Training log directory
      |──verl_grpo-{af_model_name}-<Sequence length-Device type-Timestamp>-run_log-N1-WS8.txt # Loss file
    |──saved_checkpoints                             # Training log directory
      |──global_step_{number_1}
        |──actor                                     # Weight file path. This directory is specified by the input parameter when weights are merged.
          |──huggingface                             # Output path for merged weights
      |──global_step_{number_N}
      |──latest_checkpointed_iteration.txt           # Iteration for saving the latest checkpoint

MindSpeed-RL

|──{af_output_dir} #{af_output_dir} Parameter value, for example, parameter configuration in the YAML file
    # Automatically generated data directory structure
    |──preprocessed_data                      # Data preprocessing directory
    |──converted_hf2mg_weight_TP${TP}PP${PP}  # Directory of weights converted from the Hugging Face format to the Megatron format
    |──converted_mg2hf_weight                 # Weights converted from the Megatron format to the Hugging Face format after training
    |──saved_checkpoints                      # Megatron weights after training
    |──logs                                   # Training logs
      |──grpo_metrics-<Time>.log                         # Training performance metric logs
      |──xx-xx-<Timestamp>-npu_info-R${RankID}.txt          # Training video memory monitoring logs
      |──xx-xx-<Timestamp>-run_log-${Nodes}-${RankID}.txt   # Training run logs

MindSpeed-MM

|──{af_output_dir} #{af_output_dir} Parameter value, for example, parameter configuration in the YAML file
    # Automatically generated data directory structure
    |──converted_weight_TP${TP}PP${PP}        # Directory of weights converted from the Hugging Face format to the Megatron format
    |──ckpt_converted_mg2hf                   # Weights converted from the Megatron format to the Hugging Face format after training
    |──saved_checkpoints                      # Megatron weights after training
    |──exp-config.yaml                        # Training script (generated by the ascendfactory-cli config tool)
    |──logs                                   # Training logs
      |──xx-xx-<Timestamp>-npu_info-R${RankID}.txt          # Training video RAM monitoring logs
      |──xx-xx-<Timestamp>-run_log-${Nodes}-${RankID}.txt   # Training run logs
    |──......

Checking Performance

Training performance is mainly checked based on two metrics in the training logs: throughput and convergence.

MindSpeed-LLM
1. Throughput (tokens/s/p): Global batch size x sequence length/(Total number of PUs x elapsed time per iteration) x 1000. The global batch size (GBS) and sequence length (SEQ_LEN) are set during training and printed in the logs.
2. Loss convergence: The lm loss parameter exists in the log. Its value continuously decreases along with the training iteration period, and gradually becomes stable. You can also use the visualization tool TrainingLogParser to view the loss convergence.
Llama-Factory
1. Throughput (tokens/s/p): The throughput is calculated in trainer_log.jsonl under the path specified by ${output_dir}. The value represents the average throughput across several intermediate steps. It is typically calculated and displayed in the training logs. The formula is as follows:
  delta_tokens = end_total_tokens-start_ total_tokens
  
  delta_time = end_elapsed_time - start_elapsed_time
  
  Throughput (tps) = delta_tokens/delta_time/Number of training PUs
  
  See the following figure.
2. Loss convergence: The loss convergence graph is saved as training_loss.png in your output directory (${output_dir}). To visualize the results, upload the trainer_log.jsonl file using the TrainingLogParser tool.

VeRL
1. Throughput (tokens/s/p): The performance is calculated based on the verl_grpo-{af_model_name}-<Sequence length-Device type-Timestamp>-run_log-N1-WS8.txt file in the path specified by ${output_dir}. The perf/throughput field in the training log file indicates the throughput of a single-step calculation. You can use the visualization tool TrainingLogParser to view the average training throughput.
2. Score convergence: The critic/score/mean value in the log approaches 1 and stabilizes over time during training iterations. You can also use the visualization tool TrainingLogParser to view the convergence of critic/score/mean.

MindSpeed-RL
1. Throughput (tokens/s/p): The throughput is calculated in grpo_metrics-<Time>.log under the path specified by ${output_dir}. The training log's final three metrics—e2e_tps, update_tps, and vllm_tps—show the end-to-end, training, and inference throughput for each step. Use the TrainingLogParser tool to see their overall average throughput.
2. reward acc convergence
  - By default, Qwen2.5-7B and 32B use the DeepScaleR dataset, and their verify_function employs base_acc. This results in the presence of the grpo/base_acc_rewards/mean parameter in the logs. Its value steadily rises during training iterations before eventually stabilizing. You can also use the visualization tool TrainingLogParser to view the convergence of grpo/base_acc_rewards/mean.
  - The Qwen2.5-1.5B model performs poorly. It defaults to using the math-17k dataset, with verify_function set as math_17k_acc. This creates the metric grpo/math_17k_acc_rewards/mean in the logs. Over time, as training progresses, this metric stabilizes. You can also use the visualization tool TrainingLogParser to view the convergence of grpo/math_17k_acc_rewards/mean.
MindSpeed-MM
1. Throughput (samples/s): Global batch size/Elapsed time per iteration x 1,000. The global batch size (GBS) is set during training and printed in the logs.
2. Loss convergence: The loss value exists in the log. Its value continuously decreases along with the training iteration period, and gradually becomes stable. You can also use the visualization tool TrainingLogParser to view the loss convergence.