Help Center/ Cloud Container Engine/ FAQs/ Workload/ Workload Exception Troubleshooting/ What Can I Do If There Is an Abnormal Pod and a Message Stating That the Device Files Can't Be Found?

Updated on 2025-07-17 GMT+08:00

View PDF

What Can I Do If There Is an Abnormal Pod and a Message Stating That the Device Files Can't Be Found?

Symptom

A pod fails to be created, and an error message similar to the following is displayed:

Error: failed to generate container "af736..." spec: failed to apply OCI options: lstat /dev/davinci4: no such file or directory

Run the following command to check the number of PCIe buses:

lspci | grep -i accelerator | wc -l

If the number of PCIe buses returned is less than the expected number, there will be a PCIe link down.

Possible Cause

When the PCIe link of an Ascend Snt9 device is down, the device driver cannot report this message. Consequently, tasks continue to be scheduled to the disconnected NPU.

You can run the following command to check whether there is a PCIe link down:

npu-smi info -m

Information similar to the following is displayed:

NPU ID    Chip ID     Chip Logic ID     Chip Name
0         0           0                 Ascend xxx
0         1           -                 Mcu
1         0           1                 Ascend xxx
1         1           -                 Mcu
2         0           2                 Ascend xxx
2         1           -                 Mcu
3         0           3                 Ascend xxx
3         1           -                 Mcu
5         0           4                 Ascend xxx
5         1           -                 Mcu
6         0           5                 Ascend xxx
6         1           -                 Mcu
7         0           6                 Ascend xxx
7         1           -                 Mcu

According to the NPU IDs, the fourth NPU of the Ascend Snt9 device was lost, indicating that there was a PCIe link down. During this process, the device driver rearranged the chip logic IDs of the NPUs, resulting in the loss of chip logic ID 7. When the CCE AI Suite (Ascend NPU) add-on reported information, only the chip logic IDs were updated, while the mapping between the chip logic IDs and NPU IDs remained unchanged. Consequently, Kubernetes marked the seventh NPU (originally corresponding to chip logic ID 7) as faulty and unavailable. The fourth NPU (originally corresponding to chip logic ID 4) was still incorrectly recognized as an available resource by the scheduler.

Solution

Upgrade the Snt9 device driver to version 24.1.rc2 or later. The driver of the new version ensures that the chip logic IDs are no longer rearranged after a PCIe link down, so the accurate information can be reported.

After upgrading the driver, run the following command and verify the conclusion:

npu-smi info -m

The command output shows that the driver of the new version does not rearrange the chip logic IDs after a PCIe link down.

NPU ID    Chip ID     Chip Logic ID     Chip Name
-1        0           -                 Mcu
0         0           0                 Ascend xxx
0         1           -                 Mcu
1         0           1                 Ascend xxx
1         1           -                 Mcu
2         0           2                 Ascend xxx
2         1           -                 Mcu
3         0           3                 Ascend xxx
3         1           -                 Mcu
5         0           5                 Ascend xxx
5         1           -                 Mcu
6         0           6                 Ascend xxx
6         1           -                 Mcu
7         0           7                 Ascend xxx
7         1           -                 Mcu

Parent Topic: Workload Exception Troubleshooting