Updated on 2025-11-04 GMT+08:00

MindSpeed-LLM

This section describes the YAML configuration file and parameters for training. You can choose parameters as required.

Configuring Parameters in the YAML File

Modify the YAML file.

  1. Choose either of the following dataset parameters.

    Parameter

    Example Value

    Description

    backend_config.preprocess_data.input

    [Pre-training: pt] Relative or absolute address of the pre-training dataset

    [Fine-tuning: sft] Relative or absolute address of the fine-tuning dataset

    Input data path specified during training. Change it based on the actual situation. Select either of the following options.

    backend_config.training.data-path

    /home/ma-user/ws/xxx

    Directory of the processed data. If the data has been processed, set this parameter.

  2. Set the training scenario, weight file, output directory, and other important parameters. The details are as follows.

    Parameter

    Example Value

    Description

    backend_config.training.tokenizer-name-or-path

    /home/ma-user/ws/llm_train/AscendFactory/model/llama2-70B

    (Mandatory) Path for storing the tokenizer and Hugging Face weight files. Change it based on the actual situation.

    af_output_dir

    /home/ma-user/ws/save_dir

    (Mandatory) Directory for storing logs and weight files generated after the training is complete.

    backend_config.preprocess_data.handler-name

    • GeneralPretrainHandler
    • AlpacaStyleInstructionHandler
    • SharegptStyleInstructionHandler

    (Mandatory) Select a value based on the ${dataset}.

    • GeneralPretrainHandler: Use the pre-trained Alpaca dataset.
    • AlpacaStyleInstructionHandler: Use the fine-tuned Alpaca dataset.
    • SharegptStyleInstructionHandler: Use the Sharegpt dataset.

    backend_config.convert_ckpt_mg2hf

    null

    Specifies whether to convert the weights in the Megatron format to those in the Hugging Face format during training. The conversion is performed by default. If this parameter is set to null, the conversion is not performed.

    Resumable training

    backend_config.training. no-load-optim

    backend_config.training. no-load-rng

    false

    Specifies whether to load the optimizer state.

    • false: The optimizer state is not loaded.
    • true: The optimizer state is loaded.

    backend_config.training.finetune

    false

    Specifies whether to reset the optimizer state and iteration count to 0.

    • true: Yes
    • false: No

    backend_config.training.load

    path/to/xxx

    Loads Megatron weights generated during training.

    Pre-training

    backend_config.training.stage

    false

    Training type. Set this parameter to false for pre-training.

    • false: pre-training
    • sft: instruction fine-tuning

    backend_config.preprocess_data.handler-name

    GeneralPretrainHandler

    Specifies a handler for processing datasets.

    backend_config.training.is-instruction-dataset

    false

    Specifies whether the dataset is structured.

    • true: Yes
    • false: No

    backend_config.training.finetune

    false

    Specifies whether to reset the optimizer state and iteration count to 0.

    • true: Yes
    • false: No

    reset-position-ids

    true

    Resets the index after the document end marker.

    Main scenario: when pack mode is used in dataset processing.

  3. Set other parameters.

    Parameter

    Example Value

    Description

    backend_config.training.micro-batch-size

    1

    Number of samples processed by a micro batch in pipeline parallelism. In pipeline parallelism, data of a step is divided into multiple micro batches to reduce the bubble time.

    The value is related to tensor-model-parallel-size, pipeline-model-parallel-size, and the model size. You can adjust the value based on the site requirements. The value is referred to as MBS.

    backend_config.training.global-batch-size

    128

    Number of samples processed by all servers in a step during training, which affects the training iteration time. Short name: GBS.

    backend_config.training.tensor-model-parallel-size

    8

    Tensor parallelism, referred to as TP.

    backend_config.training.pipeline-model-parallel-size

    4

    Pipeline parallelism. Generally, the value of this parameter is the number of training nodes, which is the same as the value configured during weight conversion. Short name: PP.

    backend_config.training.context-parallel-size

    1

    Context parallelism. The default value is 1. This parameter applies to training models with long sequences. If SEQ_LEN exceeds 32768 during training, you are advised to set this parameter to a value greater than or equal to 2. The value is referred to as CP.

    backend_config.training.lr

    2.5e-5

    Learning rate.

    backend_config.training.min-lr

    2.5e-6

    Minimum learning rate.

    backend_config.training.train-iters

    10

    Number of training iterations, with a default value. This parameter is optional.

    backend_config.training.save-interval

    1000

    Model saving interval.

    • If the parameter value is greater than or equal to TRAIN_ITERS, only the last version of the model that has been trained for TRAIN_ITERS times is saved.
    • If the parameter value is less than TRAIN_ITERS, the model version is saved every SAVE_INTERVAL times.

    Number of saved model versions = TRAIN_ITERS/SAVE_INTERVAL + 1

    backend_config.training.save-total-limit

    -1

    Controls the number of times that the weight version is saved.

    • It has no impact if left unset or given a value of zero or less.
    • The parameter value must be less than or equal to TRAIN_ITERS/SAVE_INTERVAL + 1.
    • If the parameter value is greater than 1, the number of model version saving times is the same as the value of SAVE_TOTAL_LIMIT.

    backend_config.training.load

    null

    Weight loading path. By default, the weights are loaded. If the value is null, the weights are not loaded.

Model Parameter Constraints

  • Tensor parallelism, pipeline parallelism, and context parallelism: The value of TP x PP x CP must be exactly divisible by the number of NPUs (word_size).
  • The value of TP x CP must be exactly divisible by num_attention_heads in the model parameters.
  • MBS (micro-batch-size) and GBS (global-batch-size): The values of GBS and MBS must be exactly divisible by NPU/(TP x PP x CP).