Executing a Training Job

Step 1: Generating a YAML File for Training Configuration

Create the model training configuration file in YAML format. Choose between interactive mode or parameter input mode. In parameter input mode, provide all necessary parameters upfront. In interactive mode, choose required parameters after running the command. Select either of these methods based on your preference.
1. Interactive mode:
```
ascendfactory-cli config  --output_file_path=<output_file_path> 
```
2. Parameter input mode:
```
ascendfactory-cli config --backend=<backend> --af_model_name=<af_model_name> --exp_name=<exp_name> --output_file_path=<output_file_path>
```
  - <backend>: framework type. The options are mindspeed-llm, llamafactory, verl, mindspeed-rl, and mindspeed-mm.
  - <af_model_name>: trained model.
  - <exp_name>: experiment type. For MindSpeed-LLM and LlaMA-Factory fine-tuning, the options are full-4k, lora-4k, and more. For MindSpeed-LLM pre-training, the option is full-4k. (See MindSpeed-LLM for parameters.) For VeRL, the option are ppo, grpo, and dapo; for MindSpeed-RL, the option is grpo; for MindSpeed-MM, the option is full.
  - <output_file_path>: output directory and file name of the YAML file, for example, /path/to/xxx.yaml.
Change the values of key parameters in the generated YAML file. For details about the parameters, see MindSpeed-LLM, LlaMA-Factory, VeRL, MindSpeed-RL, or MindSpeed-MM.

Step 2: Starting a Training Job

Run the training command in any directory, for example, the test_benchmark directory.

For details about the minimum number of PUs in the pre-training and fine-tuning phases, see Minimum Number of PUs and Sequence Length Supported by Each Model.

(optional) Single-node:

# Default: 8 PUs
ascendfactory-cli train <cfgs_yaml_file> --env.MASTER_ADDR=localhost --env.NNODES=1 --env.NODE_RANK=0
# Specify the number of devices, for example, 2 PUs.
ASCEND_RT_VISIBLE_DEVICES=0,1 ascendfactory-cli train <cfgs_yaml_file> --env.MASTER_ADDR=localhost --env.NNODES=1 --env.NODE_RANK=0
# Specify the value of a parameter in the YAML file, for example, af_output_dir. The parameter is input using hyperparameter commands.
ASCEND_RT_VISIBLE_DEVICES=0,1 ascendfactory-cli train <cfgs_yaml_file> --af_output_dir=xxx --env.MASTER_ADDR=localhost --env.NNODES=1 --env.NODE_RANK=0

(Optional) Multiple nodes: Execute the following commands on multiple nodes at once:

# Use the updated YAML file directly. Avoid inputting new parameters to change existing ones.
ascendfactory-cli train <cfgs_yaml_file> --env.MASTER_ADDR=localhost --env.NNODES=1 --env.NODE_RANK=0 
# Specify the value of a parameter in the YAML file, for example, af_output_dir. The parameter is input using hyperparameter commands.
ascendfactory-cli train <cfgs_yaml_file> --env.MASTER_ADDR=<master_addr> --env.NNODES=<nnodes> --env.NODE_RANK=<rank> --af_output_dir=xxx

<cfgs_yaml_file>: relative or absolute path of the configuration YAML file.

--env.MASTER_ADDR=<master_addr>: IP address of the active master node. Generally, rank 0 is selected as the active master node.
--env.NNODES=<nnodes>: total number of training nodes.
--env.NODE_RANK=<rank>: node ID, starting from 0. Generally, rank 0 is selected as the active master node.
-- Hyperparameter <key>: For details about the parameter key, see MindSpeed-LLM, LlaMA-Factory, VeRL, MindSpeed-RL, or MindSpeed-MM.