更新时间:2025-07-29 GMT+08:00
分享

显存溢出错误

在训练过程中,常见显存溢出报错,示例如下:

RuntimeError: NPU out of memory. Tried to allocate 1.04 GiB (NPU 4; 60.97 GiB total capacity; 56.45 GiB already allocated; 56.45 GiB current active; 1017.81 MiB free; 56.84 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

解决方法:

通过npu-smi info查看是否有进程资源占用NPU,导致训练时显存不足。解决可通过kill掉残留的进程或等待资源释放。

相关文档