Monitored BMS Events
Description
Event monitoring provides event data reporting, query, and alarm reporting. You can create alarm rules for both system events and custom events. When specific events occur, Cloud Eye generates alarms for you.
Namespace
SYS.BMS
Monitored Events
Event Name |
Event ID |
Event Severity |
Description |
Handling Solution |
Impact |
---|---|---|---|---|---|
ECC uncorrectable error alarm generated on GPU SRAM |
SRAMUncorrectableEccError |
Major |
There are ECC uncorrectable errors generated on GPU SRAM. |
If services are affected, submit a service ticket. |
The GPU hardware may be faulty. As a result, the SRAM is faulty, and services exit abnormally. |
BMS restarted |
osReboot |
Major |
The BMS was restarted
|
|
Services are interrupted. |
Unexpected restart |
serverReboot |
Major |
The BMS was restarted unexpectedly due to:
|
|
Services are interrupted. |
BMS stopped |
osShutdown |
Major |
The BMS was stopped
|
|
Services are interrupted. |
BMS unexpected shutdown |
serverShutdown |
Major |
The BMS was stopped unexpectedly due to:
|
|
Services are interrupted. |
Network disconnection |
linkDown |
Major |
The BMS network was disconnected due to:
|
|
Services are interrupted. |
PCIe error |
pcieError |
Major |
The PCIe device on the BMS was faulty, including:
|
|
The network or disk read/write services are affected. |
Disk fault |
diskError |
Major |
The disks of the BMS were faulty, including:
|
|
Data read/write services are affected, or the BMS cannot be started. |
EVS error |
storageError |
Major |
The BMS failed to connect to EVS disks due to:
|
|
Data read/write services are affected, or the BMS cannot be started. |
Inforom alarm generated on GPU |
gpuInfoROMAlarm |
Major |
The infoROM of the GPU is abnormal. ROM is an important storage area of the GPU firmware and stores key data loaded during startup. |
Non-critical services can continue to use the GPU. For critical services, submit a service ticket to resolve this issue.
|
Services will not be affected. If ECC errors are reported on a GPU, faulty pages may not be automatically retired and services are affected. |
Double-bit ECC alarm generated on GPU |
doubleBitEccError |
Major |
A double-bit error occurs in the ECC memory of the GPU. The ECC cannot correct the error, which may cause program breakdown. |
|
Services may be interrupted. After faulty pages are retired, the GPU can continue to be used. |
Too many retired pages |
gpuTooManyRetiredPagesAlarm |
Major |
An ECC page retirement error occurred on the GPU. When an uncorrectable ECC error occurs on a GPU memory page, the GPU marks the page as retired. |
If services are affected, submit a service ticket. |
If there are too many ECC errors, services may be affected.
|
ECC alarm generated on GPU A100 |
gpuA100EccAlarm |
Major |
An ECC error occurred on the GPU. |
|
Services may be interrupted. After faulty pages are retired, the GPU can continue to be used. |
ECC alarm generated on GPU Ant1 |
gpuAnt1EccAlarm |
Major |
An ECC error occurred on the GPU. |
|
Services may be interrupted. After faulty pages are retired, the GPU can continue to be used. |
GPU ECC memory page retirement failure |
eccPageRetirementRecordingFailure |
Major |
Automatic page retirement failed due to ECC errors. |
|
Services may be interrupted, and memory page retirement fails. As a result, services cannot no longer use the GPU. |
GPU ECC page retirement alarm generated |
eccPageRetirementRecordingEvent |
Minor |
Memory pages are automatically retired due to ECC errors. |
1. If services are interrupted, restart the services. 2. If services cannot be restarted, restart the VM where services are running. 3. If services still cannot be restored, submit a service ticket. |
Generally, this alarm is generated together with the ECC error alarm. If this alarm is generated independently, services are not affected. |
Too many single-bit ECC errors on GPU |
highSingleBitEccErrorRate |
Major |
There are too many single-bit errors occurring in the ECC memory of the GPU. |
|
Single-bit errors can be automatically rectified. These errors generally do not affect GPU-related applications. |
GPU card not found |
gpuDriverLinkFailureAlarm |
Major |
A GPU link is normal, but it cannot be found by the NVIDIA driver. |
1. Try to restart the VM to restore your services. 2. If services still cannot be restored, submit a service ticket. |
The GPU cannot be found. |
GPU link faulty |
gpuPcieLinkFailureAlarm |
Major |
GPU hardware information cannot be queried through lspci due to a GPU link fault. |
If services are affected, submit a service ticket. |
The driver cannot use the GPU. |
VM GPU lost |
vmLostGpuAlarm |
Major |
The number of GPUs on the VM is less than the number specified in the specifications. |
If services are affected, submit a service ticket. |
GPUs get lost. |
GPU memory page faulty |
gpuMemoryPageFault |
Major |
The GPU memory page is faulty, which may be caused by applications, drivers, or hardware. |
If services are affected, submit a service ticket. |
The GPU hardware may be faulty. As a result, the GPU memory is faulty, and services exit abnormally. |
GPU image engine faulty |
graphicsEngineException |
Major |
The GPU image engine is faulty, which may be caused by applications, drivers, or hardware. |
If services are affected, submit a service ticket. |
The GPU hardware may be faulty. As a result, the image engine is faulty, and services exit abnormally. |
GPU temperature too high |
highTemperatureEvent |
Major |
The GPU temperature is too high. |
If services are affected, submit a service ticket. |
If the GPU temperature exceeds the threshold, the GPU performance may deteriorate. |
GPU NVLink faulty |
nvlinkError |
Major |
A hardware fault occurs on the NVLink. |
If services are affected, submit a service ticket. |
The NVLink link is faulty and unavailable. |
System maintenance inquiring |
system_maintenance_inquiring |
Major |
The scheduled BMS maintenance task is being inquired. |
Authorize the maintenance. |
None |
System maintenance waiting |
system_maintenance_scheduled |
Major |
The scheduled BMS maintenance task is waiting to be executed. |
Clarify the impact on services during the execution window. |
None |
System maintenance canceled |
system_maintenance_canceled |
Major |
The scheduled BMS maintenance is canceled. |
None |
None |
System maintenance executing |
system_maintenance_executing |
Major |
BMSs are being maintained as scheduled. |
After the maintenance is complete, check whether services are affected. |
Services are interrupted. |
System maintenance completed |
system_maintenance_completed |
Major |
The scheduled BMS maintenance is completed. |
Wait until the BMSs become available and check whether services recover. |
None |
System maintenance failure |
system_maintenance_failed |
Major |
The scheduled BMS maintenance task failed. |
Contact O&M personnel. |
Services are interrupted. |
GPU Xid error |
commonXidError |
Major |
An Xid event alarm was generated on the GPU. |
If services are affected, submit a service ticket. |
An Xid error is caused by GPU hardware, driver, or application problems, which may result in abnormal service exit. |
NPU: device not found by npu-smi info |
NPUSMICardNotFound |
Major |
The Ascend driver is faulty, or the NPU is disconnected. |
Transfer this issue to the Ascend or hardware team for handling. |
The NPU cannot be used normally. |
NPU: PCIe link error |
PCIeErrorFound |
Major |
The lspci command returns rev ff, indicating that the NPU is abnormal. |
Restart the BMS. If the issue persists, transfer it to the hardware team for processing. |
The NPU cannot be used normally. |
NPU: device not found by lspci |
LspciCardNotFound |
Major |
The NPU is disconnected. |
Transfer this issue to the hardware team for handling. |
The NPU cannot be used normally. |
NPU: overtemperature |
TemperatureOverUpperLimit |
Major |
The temperature of DDR or software is too high. |
Stop services, restart the BMS, check the heat dissipation system, and reset the devices. |
The BMS may be powered off, and devices may not be found. |
NPU: uncorrectable ECC error |
UncorrectableEccErrorCount |
Major |
There are uncorrectable ECC errors on the NPU. |
If services are affected, replace the NPU with another one. |
Services may be interrupted. |
NPU: request for BMS restart |
RebootVirtualMachine |
Warning |
The BMS needs to be restarted. |
Collect the required information and restart the BMS. |
Services may be interrupted. |
NPU: request for SoC reset |
ResetSOC |
Warning |
The SoC needs to be reset. |
Collect the required information and reset the SoC. |
Services may be interrupted. |
NPU: request for restart AI process |
RestartAIProcess |
Warning |
The AI process needs to be restarted. |
Collect the required information and restart the AI process. |
The current AI task will be interrupted. |
NPU: error codes |
NPUErrorCodeWarning |
Major |
There are a large number of NPU error codes indicating major or higher-level errors. You can further locate the faults based on the error codes. |
Locate the faults according to the Black Box Error Code Information List and Health Management Error Definition. |
Services may be interrupted. |
nvidia-smi suspended |
nvidiaSmiHangEvent |
Major |
nvidia-smi timed out. |
If services are affected, submit a service ticket. |
The driver may report an error during service running. |
nv_peer_mem loading error |
NvPeerMemException |
Minor |
The NVLink or nv_peer_mem cannot be loaded. |
Restore or reinstall the NVLink. |
nv_peer_mem cannot be used. |
Fabric Manager error |
NvFabricManagerException |
Minor |
The BMS meets the NVLink conditions and NVLink is installed, but Fabric Manager is abnormal. |
Restore or reinstall the NVLink. |
NVLink cannot be used normally. |
IB card error |
InfinibandStatusException |
Major |
The IB card or its physical status is abnormal. |
Transfer this issue to the hardware team for handling. |
The IB card cannot work normally. |
GPU throttle alarm |
gpuClocksThrottleReasonsAlarm |
Warning |
|
Check whether the clock frequency decrease is caused by hardware faults. If yes, transfer it to the hardware team. |
The GPU slows down, resulting in less powerful compute. |
Pending page retirement for GPU DRAM ECC |
gpuRetiredPagesPendingAlarm |
Major |
|
|
The GPU cannot work properly. |
Pending row remapping for GPU DRAM ECC |
gpuRemappedRowsAlarm |
Major |
Some rows in the GPU memory have errors and need to be remapped. The faulty rows must be mapped to standby resources. |
|
The GPU cannot work properly. |
Insufficient resources for GPU DRAM ECC row remapping |
gpuRowRemapperResourceAlarm |
Major |
|
Transfer the issue to the hardware team. |
The GPU cannot work properly. |
Correctable GPU DRAM ECC error |
gpuDRAMCorrectableEccError |
Major |
|
|
The GPU may not work properly. |
Uncorrectable GPU DRAM ECC error |
gpuDRAMUncorrectableEccError |
Major |
|
|
The GPU may not work properly. |
Inconsistent GPU kernel versions |
gpuKernelVersionInconsistencyAlarm |
Major |
Inconsistent GPU kernel versions. During driver installation, the GPU driver is compiled based on the kernel at that time. If the kernel versions are identified inconsistent, the kernel has been customized after the driver installation. In this case, the driver would become unavailable and needs to be reinstalled. |
|
The GPU cannot work properly. |
GPU monitoring dependency not met |
gpuCheckEnvFailedAlarm |
Major |
The plug-in cannot identify the GPU driver library path. |
|
The GPU metrics cannot be collected. |
Initialization failure of the GPU monitoring driver library |
gpuDriverInitFailedAlarm |
Major |
The GPU driver is unavailable. |
Run nvidia-smi to check whether the driver is unavailable. If the driver is unavailable, reinstall the driver by referring to Manually Installing a Tesla Driver on a GPU-accelerated ECS. |
The GPU metrics cannot be collected. |
Initialization timeout of the GPU monitoring driver library |
gpuDriverInitTAlarm |
Major |
The GPU driver initialization timed out (exceeding 10s). |
|
The GPU metrics cannot be collected. |
GPU metric collection timeout |
gpuCollectMetricTimeoutAlarm |
Major |
The GPU metric collection timed out (exceeding 10s). |
|
GPU monitoring metric data is missing. As a result, subsequent metrics may fail to be collected. |
GPU handle lost |
gpuDeviceHandleLost |
Major |
The GPU metric information cannot be obtained, and the GPU may be lost. |
All metrics of the GPU are lost. |
|
Failed to listen to the XID of the GPU |
gpuDeviceXidLost |
Major |
Failed to listen to the XID metric. |
|
Failed to obtain XID-related metrics of the GPU. |
Multiple NPU HBM ECC errors |
NpuHbmMultiEccInfo |
Warning |
There are NPU HBM ECC errors. |
This event is only a reference for other events. You do not need to handle it separately. |
The NPU may not work properly. |
ReadOnly issues in OS |
ReadOnlyFileSystem |
Critical |
The file system %s is read-only. |
Check the disk health status. |
The files cannot be written. |
NPU: driver and firmware not matching |
NpuDriverFirmwareMismatch |
Major |
The NPU's driver and firmware do not match. |
Obtain the matched version from the Ascend official website and reinstall it. |
NPUs cannot be used. |
NPU: Docker container environment check |
NpuContainerEnvSystem |
Major |
Docker was unavailable. |
Check if Docker is normal. |
Docker cannot be used. |
Major |
The container plug-in Ascend-Docker-Runtime was not installed. |
Install the container plug-in Ascend-Docker-Runtime, or the container cannot use Ascend cards. |
NPUs cannot be attached to Docker containers. |
||
Major |
IP forwarding was not enabled in the OS. |
Check the net.ipv4.ip_forward configuration in the /etc/sysctl.conf file. |
Docker containers have network communication problems. |
||
Major |
The shared memory of the container was too small. |
The default shared memory is 64 MB, which can be modified as needed. Method 1 Modify the default-shm-size field in the /etc/docker/daemon.json configuration file. Method 2 Use the --shm-size parameter in the docker run command to set the shared memory size of a container. |
Distributed training will fail due to insufficient shared memory. |
||
NPU: RoCE NIC down |
RoCELinkStatusDown |
Major |
The RoCE link of NPU card %d was down. |
Check the NPU RoCE network port status. |
The NPU NIC is unavailable. |
NPU: RoCE NIC health status abnormal |
RoCEHealthStatusError |
Major |
The RoCE network health status of NPU %d was abnormal. |
Check the health status of the NPU RoCE NIC. |
The NPU NIC is unavailable. |
NPU: RoCE NIC configuration file /etc/hccn.conf not found |
HccnConfNotExisted |
Major |
The RoCE NIC configuration file /etc/hccn.conf was not found. |
Check whether the NIC configuration file /etc/hccn.conf can be found. |
The RoCE NIC is unavailable. |
GPU: basic components abnormal |
GpuEnvironmentSystem |
Major |
The nvidia-smi command was abnormal. |
Check whether the GPU driver is normal. |
The GPU driver is unavailable. |
Major |
The nvidia-fabricmanager version was inconsistent with the GPU driver version. |
Check the GPU driver version and nvidia-fabricmanager version. |
The nvidia-fabricmanager cannot work properly, affecting GPU usage. |
||
Major |
The container add-on nvidia-container-toolkit was not installed. |
Install the container add-on nvidia-container-toolkit. |
GPUs cannot be attached to Docker containers. |
||
Local disk attachment inspection |
MountDiskSystem |
Major |
The /etc/fstab file contains invalid UUIDs. |
Ensure that the UUIDs in the /etc/fstab configuration file are correct, or the server may fail to be restarted. |
The disk attachment process fails, preventing the server from restarting. |
GPU: incorrectly configured dynamic route for Ant series server |
GpuRouteConfigError |
Major |
The dynamic route of the NIC %s of an Ant series server was not configured or was incorrectly configured. CMD [ip route]: %s | CMD [ip route show table all]: %s. |
Configure the RoCE NIC route correctly. |
The NPU network communication will be interrupted. |
NPU: RoCE port not split |
RoCEUdpConfigError |
Major |
The RoCE UDP port was not split. |
Check the RoCE UDP port configuration on the NPU. |
The communication performance of NPUs is affected. |
Warning of automatic system kernel upgrade |
KernelUpgradeWarning |
Major |
Warning of automatic system kernel upgrade. Old version: %s; new version: %s. |
System kernel upgrade may cause AI software exceptions. Check the system update logs and prevent the server from restarting. |
The AI software may be unavailable. |
NPU environment command detection |
NpuToolsWarning |
Major |
The hccn_tool was unavailable. |
Check whether the NPU driver is normal. |
The IP address and gateway of the RoCE NIC cannot be configured. |
Major |
The npu-smi was unavailable. |
Check whether the NPU driver is normal. |
NPUs cannot be used. |
||
Major |
The ascend-dmi was unavailable. |
Check whether ToolBox is properly installed. |
ascend-dmi cannot be used for performance analysis. |
||
Warning of an NPU driver exception |
NpuDriverAbnormalWarning |
Major |
The NPU driver was abnormal. |
Reinstall the NPU driver. |
NPUs cannot be used. |
GPU: invalid RoCE NIC configuration |
GpuRoceNicConfigIncorrect |
Major |
The RoCE NIC of the GPU is incorrectly configured. |
Contact O&M personnel. |
The parameter plane network is abnormal, preventing the execution of the multi-node task. |
Local disk replacement to be authorized |
localdisk_recovery_inquiring |
Major |
The local disk is faulty. Local disk replacement authorization is in progress. |
Authorize local disk replacement. |
Local disks are unavailable. |
Local disks being replaced |
localdisk_recovery_executing |
Major |
Local disks are faulty and being replaced. |
When the replacement is complete, check whether the local disks are available. |
Local disks are unavailable. |
Local disks replaced |
localdisk_recovery_completed |
Major |
Local disks are faulty and have been replaced. |
Wait until the services are running properly and check whether local disks are available. |
None |
Local disk replacement failed |
localdisk_recovery_failed |
Major |
Local disks are faulty and fail to be replaced. |
Contact O&M personnel. |
Local disks are unavailable. |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot