Help Center/ ModelArts/ Best Practices/ LLM Training/ Adapting Mainstream Open-Source Models to AscendFactory NPU Training Based on Lite Server/ Common Error Causes and Solutions

Updated on 2025-11-04 GMT+08:00

View PDF

Common Error Causes and Solutions

The following links list common errors during training.

Out of Memory
Incorrect NIC Name
Timeout Error for Checkpoint Saving
Installing Third-Party Dependency Packages Failed Using DockerFile or install.sh
Timeout Occurs When the Llama-Factory Framework Preprocesses a Large Dataset
Llama-Factory-based Training Suspended at a Certain Step
Running setup.py in DockerFile or install.sh in the Llama-Factory Environment Failed
MindSpeed-LLM Distilled Model Training Precision Issues

Out of Memory

During training, the following out of memory error is reported:

RuntimeError: NPU out of memory. Tried to allocate 1.04 GiB (NPU 4; 60.97 GiB total capacity; 56.45 GiB already allocated; 56.45 GiB current active; 1017.81 MiB free; 56.84 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

[Solution] The solutions for MindSpeed-LLM and Llama-Factory are different. Select a solution based on the framework.

MindSpeed-LLM
- Use the npu-smi info command to verify if any processes are using NPU resources, which could cause low memory during training. In this case, kill residual processes or wait for the resources to be released.
- Adjust the values of tensor-model-parallel-size (TP) and pipeline-model-parallel-size (PP). Ensure that the value of TP x PP does not exceed the number of NPUs and be exactly divisible by it.
- Adjust the values of micro-batch-size (MBS, minimum number of samples processed in a batch) and global-batch-size (GBS, number of samples processed in an iteration). Set micro-batch-size to 1 and ensure that the value of GBS/MBS can be exactly divided by the value of NPUs/(TP x PP).
- Set SEQ_LEN to control the maximum sequence length for processing. A high value can cause out-of-memory errors.
- Add the recomputation parameter to the 3_training.sh file. The value of recompute-num-layers is the value of num-layers in the model network.
```
--recompute-granularity full \
--recompute-method block \
--recompute-num-layers {NUM_LAYERS} \
```
Llama-Factory
- Adjust per_device_train_batch_size (minimum number of samples processed by a batch) to 1.
- Adjust DeepSpeed's zero level gradually.
  - - ZeRO-0: Data is distributed to different NPUs.
  - - ZeRO-1: Optimizer states are distributed to different NPUs.
  - - ZeRO-2: Optimizer states and gradients are distributed to different NPUs.
  - - ZeRO-2-Offload: Optimizer states are distributed to different NPUs and offload is enabled.
  - - ZeRO-3: Optimizer states, gradients, and model parameters are distributed to different NPUs.
  - - ZeRO-3-Offload: Optimizer states, gradients, and model parameters are distributed to different NPUs and offload is enabled.
- Increase the number of PUs in ascending order.

Incorrect NIC Name

If the system displays a message indicating that the NIC name is incorrect or the communication times out when the training starts, run the ifconfig command to check whether the NIC name is correct.

For example, if the ifconfig command shows that the server's IP address uses the NIC named enp67s0f5, set this as the environment variable for the NIC name.

Figure 1 Incorrect NIC name
Click to enlarge

export GLOO_SOCKET_IFNAME=enp67s0f5   # Specify the network port name when multiple nodes communicate with each other using GLOO,
export TP_SOCKET_IFNAME=enp67s0f5   # Specify the network port name when multiple nodes communicate with each other using TP,
export HCCL_SOCKET_IFNAME=enp67s0f5   # Specify the network port name when multiple nodes communicate with each other using HCCL,

For details about environment variables, see Distributed communication package - torch.distributed.

Timeout Error for Checkpoint Saving

Once multi-node cluster training finishes, only certain nodes save the weights while others wait for communication. If this wait lasts over 36 minutes, a timeout error happens.

Figure 2 Error message
Click to enlarge

[Solution]

Check if the disk I/O bandwidth works properly to save the file in under 36 minutes. For one node, the largest file size allowed is 60 GB, but keep it below 40 GB for better performance. Saving the file within 36 minutes prevents timeout errors.
Ignore the error because the error does not affect the weights.

Installing Third-Party Dependency Packages Failed Using DockerFile or install.sh

[Symptom]

Downloading and installing third-party dependency packages such as Llama-Factory and MindSpeed-LLM in AscendFactory/dependences.yaml failed.

Click to enlarge

[Root Cause]

Fetching data from Git failed due to no internet connection.

[Solution]

Set up a proxy or use a server with internet access to download third-party dependencies listed in AscendFactory/dependences.yaml. Ensure the package names and versions match the ${save_name} and ${version} entries in the file. Move these packages to the AscendFactory/third-party folder, then rerun the Dockerfile or install.sh script.

Timeout Occurs When the Llama-Factory Framework Preprocesses a Large Dataset

[Root Cause]

The Llama-Factory framework initially handles data on PU 0 before moving sequentially through PUs 1 to 7. This sequential approach takes too long and often leads to timeouts.

[Solution]

Solution A: Update the LLaMA-Factory barrier policy to process PUs 0 to 7 simultaneously instead of handling PU 0 first followed by PUs 1 to 7. Execute this command before starting the training:
```
export DISABLE_MAIN_PROCESS_FIRST = True
```
Solution B: Keep the default processing policy but set the training job's timeout to 2 hours. Execute this command before starting the training:
```
export ACL_DEVICE_SYNC_TIMEOUT=7200
```
Solution B is user-friendly but may time out after two hours with very large datasets. You can adjust the timeout setting as needed.

Llama-Factory-based Training Suspended at a Certain Step

[Symptom]

The multi-node training job suspends for two hours during a specific step, causing the job to time out.

Click to enlarge

[Root Cause]

The ascend_trace thread locks resources when capturing the call stack. The dataloader_worker process inherits this lock after being forked, causing it to suspend as it cannot acquire the lock.

[Solution]

Before starting a training job, load the export ASCEND_COREDUMP_SIGNAL=none environment variable to disable the stack trace.

export ASCEND_COREDUMP_SIGNAL=none

Running setup.py in DockerFile or install.sh in the Llama-Factory Environment Failed

[Symptom]

Running setup.py in the Llama-Factory code directory failed. The error message "SetuptoolsDeprecationWarning: License classifiers are deprecated." is displayed.

Figure 3 Error message reported by setup.py
Click to enlarge

[Root Cause]

The pip dependency package version is too early, causing conflicts with other dependency packages.

[Solution]

Add the pip install --upgrade pip command to the AscendFactory/install.sh file.
Run the DockerFile or install.sh again.

MindSpeed-LLM Distilled Model Training Precision Issues

[Root cause]

The MindSpeed-LLM framework uses fixed parameter values that do not match the distilled model's setup, causing training precision issues.

[Solution]

Update the parameter values in scripts_modellink/{model}/3_training.sh to match those in the config.json file before starting training. Refer to the table below for details.