Executing a Training Job
Step 1: Generating a YAML File for Training Configuration
- Create the model training configuration file in YAML format. Choose between interactive mode or parameter input mode. In parameter input mode, provide all necessary parameters upfront. In interactive mode, choose required parameters after running the command. Select either of these methods based on your preference.
- Interactive mode:
ascendfactory-cli config --output_file_path=<output_file_path>
- Parameter input mode:
ascendfactory-cli config --backend=<backend> --af_model_name=<af_model_name> --exp_name=<exp_name> --output_file_path=<output_file_path>
- <backend>: framework type. The options are mindspeed-llm, llamafactory, verl, mindspeed-rl, and mindspeed-mm.
- <af_model_name>: trained model.
- <exp_name>: experiment type. For MindSpeed-LLM and LlaMA-Factory fine-tuning, the options are full-4k, lora-4k, and more. For MindSpeed-LLM pre-training, the option is full-4k. (See MindSpeed-LLM for parameters.) For VeRL, the option are ppo, grpo, and dapo; for MindSpeed-RL, the option is grpo; for MindSpeed-MM, the option is full.
- <output_file_path>: output directory and file name of the YAML file, for example, /path/to/xxx.yaml.
- Interactive mode:
- Change the values of key parameters in the generated YAML file. For details about the parameters, see MindSpeed-LLM, LlaMA-Factory, VeRL, MindSpeed-RL, or MindSpeed-MM.
Step 2: Starting a Training Job
- Run the training command in any directory, for example, the test_benchmark directory.
For details about the minimum number of PUs in the pre-training and fine-tuning phases, see Minimum Number of PUs and Sequence Length Supported by Each Model.
(optional) Single-node:
# Default: 8 PUs ascendfactory-cli train <cfgs_yaml_file> --env.MASTER_ADDR=localhost --env.NNODES=1 --env.NODE_RANK=0 # Specify the number of devices, for example, 2 PUs. ASCEND_RT_VISIBLE_DEVICES=0,1 ascendfactory-cli train <cfgs_yaml_file> --env.MASTER_ADDR=localhost --env.NNODES=1 --env.NODE_RANK=0 # Specify the value of a parameter in the YAML file, for example, af_output_dir. The parameter is input using hyperparameter commands. ASCEND_RT_VISIBLE_DEVICES=0,1 ascendfactory-cli train <cfgs_yaml_file> --af_output_dir=xxx --env.MASTER_ADDR=localhost --env.NNODES=1 --env.NODE_RANK=0
(Optional) Multiple nodes: Execute the following commands on multiple nodes at once:
# Use the updated YAML file directly. Avoid inputting new parameters to change existing ones. ascendfactory-cli train <cfgs_yaml_file> --env.MASTER_ADDR=localhost --env.NNODES=1 --env.NODE_RANK=0 # Specify the value of a parameter in the YAML file, for example, af_output_dir. The parameter is input using hyperparameter commands. ascendfactory-cli train <cfgs_yaml_file> --env.MASTER_ADDR=<master_addr> --env.NNODES=<nnodes> --env.NODE_RANK=<rank> --af_output_dir=xxx
- <cfgs_yaml_file>: relative or absolute path of the configuration YAML file.
- --env.MASTER_ADDR=<master_addr>: IP address of the active master node. Generally, rank 0 is selected as the active master node.
- --env.NNODES=<nnodes>: total number of training nodes.
- --env.NODE_RANK=<rank>: node ID, starting from 0. Generally, rank 0 is selected as the active master node.
- -- Hyperparameter <key>: For details about the parameter key, see MindSpeed-LLM, LlaMA-Factory, VeRL, MindSpeed-RL, or MindSpeed-MM.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot