Updated on 2025-07-22 GMT+08:00

Monitored BMS Events

Description

Event monitoring provides event data reporting, query, and alarm reporting. You can create alarm rules for both system events and custom events. When specific events occur, Cloud Eye generates alarms for you.

Namespace

SYS.BMS

Monitored Events

Table 1 Monitored BMS events

Event Name

Event ID

Event Severity

Description

Handling Solution

Impact

ECC uncorrectable error alarm generated on GPU SRAM

SRAMUncorrectableEccError

Major

There are ECC uncorrectable errors generated on GPU SRAM.

If services are affected, submit a service ticket.

The GPU hardware may be faulty. As a result, the SRAM is faulty, and services exit abnormally.

BMS restarted

osReboot

Major

The BMS was restarted

  • On the management console
  • By calling APIs
  • Deploy service applications in HA mode.
  • After the BMS is restarted, check whether services recover.

Services are interrupted.

Unexpected restart

serverReboot

Major

The BMS was restarted unexpectedly due to:

  • OS faults
  • Hardware faults
  • Deploy service applications in HA mode.
  • After the BMS is restarted, check whether services recover.

Services are interrupted.

BMS stopped

osShutdown

Major

The BMS was stopped

  • On the management console
  • By calling APIs
  • Deploy service applications in HA mode.
  • After the BMS is restarted, check whether services recover.

Services are interrupted.

BMS unexpected shutdown

serverShutdown

Major

The BMS was stopped unexpectedly due to:

  • Unexpected power-off
  • Hardware faults
  • Deploy service applications in HA mode.
  • After the BMS is restarted, check whether services recover.

Services are interrupted.

Network disconnection

linkDown

Major

The BMS network was disconnected due to:

  • Unexpected BMS stop or restart
  • Switch faults
  • Gateway faults
  • Deploy service applications in HA mode.
  • After the BMS is restarted, check whether services recover.

Services are interrupted.

PCIe error

pcieError

Major

The PCIe device on the BMS was faulty, including:

  • Main board faults
  • PCIe device faults
  • Deploy service applications in HA mode.
  • After the BMS is restarted, check whether services recover.

The network or disk read/write services are affected.

Disk fault

diskError

Major

The disks of the BMS were faulty, including:

  • Disk backplane faults
  • Disk faults
  • Deploy service applications in HA mode.
  • After the BMS is restarted, check whether services recover.

Data read/write services are affected, or the BMS cannot be started.

EVS error

storageError

Major

The BMS failed to connect to EVS disks due to:

  • SDI card faults
  • Remote storage device faults
  • Deploy service applications in HA mode.
  • After the BMS is restarted, check whether services recover.

Data read/write services are affected, or the BMS cannot be started.

Inforom alarm generated on GPU

gpuInfoROMAlarm

Major

The infoROM of the GPU is abnormal. ROM is an important storage area of the GPU firmware and stores key data loaded during startup.

Non-critical services can continue to use the GPU. For critical services, submit a service ticket to resolve this issue.

  1. Restart the VM and check that the issue is not caused by a temporary cache or communication error.
  2. If the fault persists after the restart, the hardware may be faulty. Submit a service ticket to check whether the GPU needs to be replaced.

Services will not be affected. If ECC errors are reported on a GPU, faulty pages may not be automatically retired and services are affected.

Double-bit ECC alarm generated on GPU

doubleBitEccError

Major

A double-bit error occurs in the ECC memory of the GPU. The ECC cannot correct the error, which may cause program breakdown.

  1. If services are interrupted, restart the services.
  2. If services cannot be restarted, restart the VM where services are running.
  3. If services still cannot be restored, submit a service ticket.

Services may be interrupted. After faulty pages are retired, the GPU can continue to be used.

Too many retired pages

gpuTooManyRetiredPagesAlarm

Major

An ECC page retirement error occurred on the GPU. When an uncorrectable ECC error occurs on a GPU memory page, the GPU marks the page as retired.

If services are affected, submit a service ticket.

If there are too many ECC errors, services may be affected.

  1. If there are too many retired pages and the GPU memory capacity decreases too much, the system performance may deteriorate.
  2. If there are too many retired pages and the GPU memory capacity decreases too much, the system may run unstably.

ECC alarm generated on GPU A100

gpuA100EccAlarm

Major

An ECC error occurred on the GPU.

  1. If services are interrupted, restart the services.
  2. If services cannot be restarted, restart the VM where services are running.
  3. If services still cannot be restored, submit a service ticket.

Services may be interrupted. After faulty pages are retired, the GPU can continue to be used.

ECC alarm generated on GPU Ant1

gpuAnt1EccAlarm

Major

An ECC error occurred on the GPU.

  1. If services are interrupted, restart the services.
  2. If services cannot be restarted, restart the VM where services are running.
  3. If services still cannot be restored, submit a service ticket.

Services may be interrupted. After faulty pages are retired, the GPU can continue to be used.

GPU ECC memory page retirement failure

eccPageRetirementRecordingFailure

Major

Automatic page retirement failed due to ECC errors.

  1. If services are interrupted, restart the services.
  2. If services cannot be restarted, restart the VM where services are running.
  3. If services still cannot be restored, submit a service ticket.

Services may be interrupted, and memory page retirement fails. As a result, services cannot no longer use the GPU.

GPU ECC page retirement alarm generated

eccPageRetirementRecordingEvent

Minor

Memory pages are automatically retired due to ECC errors.

1. If services are interrupted, restart the services.

2. If services cannot be restarted, restart the VM where services are running.

3. If services still cannot be restored, submit a service ticket.

Generally, this alarm is generated together with the ECC error alarm. If this alarm is generated independently, services are not affected.

Too many single-bit ECC errors on GPU

highSingleBitEccErrorRate

Major

There are too many single-bit errors occurring in the ECC memory of the GPU.

  1. If services are interrupted, restart the services.
  2. If services cannot be restarted, restart the VM where services are running.
  3. If services still cannot be restored, submit a service ticket.

Single-bit errors can be automatically rectified. These errors generally do not affect GPU-related applications.

GPU card not found

gpuDriverLinkFailureAlarm

Major

A GPU link is normal, but it cannot be found by the NVIDIA driver.

1. Try to restart the VM to restore your services.

2. If services still cannot be restored, submit a service ticket.

The GPU cannot be found.

GPU link faulty

gpuPcieLinkFailureAlarm

Major

GPU hardware information cannot be queried through lspci due to a GPU link fault.

If services are affected, submit a service ticket.

The driver cannot use the GPU.

VM GPU lost

vmLostGpuAlarm

Major

The number of GPUs on the VM is less than the number specified in the specifications.

If services are affected, submit a service ticket.

GPUs get lost.

GPU memory page faulty

gpuMemoryPageFault

Major

The GPU memory page is faulty, which may be caused by applications, drivers, or hardware.

If services are affected, submit a service ticket.

The GPU hardware may be faulty. As a result, the GPU memory is faulty, and services exit abnormally.

GPU image engine faulty

graphicsEngineException

Major

The GPU image engine is faulty, which may be caused by applications, drivers, or hardware.

If services are affected, submit a service ticket.

The GPU hardware may be faulty. As a result, the image engine is faulty, and services exit abnormally.

GPU temperature too high

highTemperatureEvent

Major

The GPU temperature is too high.

If services are affected, submit a service ticket.

If the GPU temperature exceeds the threshold, the GPU performance may deteriorate.

GPU NVLink faulty

nvlinkError

Major

A hardware fault occurs on the NVLink.

If services are affected, submit a service ticket.

The NVLink link is faulty and unavailable.

System maintenance inquiring

system_maintenance_inquiring

Major

The scheduled BMS maintenance task is being inquired.

Authorize the maintenance.

None

System maintenance waiting

system_maintenance_scheduled

Major

The scheduled BMS maintenance task is waiting to be executed.

Clarify the impact on services during the execution window.

None

System maintenance canceled

system_maintenance_canceled

Major

The scheduled BMS maintenance is canceled.

None

None

System maintenance executing

system_maintenance_executing

Major

BMSs are being maintained as scheduled.

After the maintenance is complete, check whether services are affected.

Services are interrupted.

System maintenance completed

system_maintenance_completed

Major

The scheduled BMS maintenance is completed.

Wait until the BMSs become available and check whether services recover.

None

System maintenance failure

system_maintenance_failed

Major

The scheduled BMS maintenance task failed.

Contact O&M personnel.

Services are interrupted.

GPU Xid error

commonXidError

Major

An Xid event alarm was generated on the GPU.

If services are affected, submit a service ticket.

An Xid error is caused by GPU hardware, driver, or application problems, which may result in abnormal service exit.

NPU: device not found by npu-smi info

NPUSMICardNotFound

Major

The Ascend driver is faulty, or the NPU is disconnected.

Transfer this issue to the Ascend or hardware team for handling.

The NPU cannot be used normally.

NPU: PCIe link error

PCIeErrorFound

Major

The lspci command returns rev ff, indicating that the NPU is abnormal.

Restart the BMS. If the issue persists, transfer it to the hardware team for processing.

The NPU cannot be used normally.

NPU: device not found by lspci

LspciCardNotFound

Major

The NPU is disconnected.

Transfer this issue to the hardware team for handling.

The NPU cannot be used normally.

NPU: overtemperature

TemperatureOverUpperLimit

Major

The temperature of DDR or software is too high.

Stop services, restart the BMS, check the heat dissipation system, and reset the devices.

The BMS may be powered off, and devices may not be found.

NPU: uncorrectable ECC error

UncorrectableEccErrorCount

Major

There are uncorrectable ECC errors on the NPU.

If services are affected, replace the NPU with another one.

Services may be interrupted.

NPU: request for BMS restart

RebootVirtualMachine

Warning

The BMS needs to be restarted.

Collect the required information and restart the BMS.

Services may be interrupted.

NPU: request for SoC reset

ResetSOC

Warning

The SoC needs to be reset.

Collect the required information and reset the SoC.

Services may be interrupted.

NPU: request for restart AI process

RestartAIProcess

Warning

The AI process needs to be restarted.

Collect the required information and restart the AI process.

The current AI task will be interrupted.

NPU: error codes

NPUErrorCodeWarning

Major

There are a large number of NPU error codes indicating major or higher-level errors. You can further locate the faults based on the error codes.

Locate the faults according to the Black Box Error Code Information List and Health Management Error Definition.

Services may be interrupted.

nvidia-smi suspended

nvidiaSmiHangEvent

Major

nvidia-smi timed out.

If services are affected, submit a service ticket.

The driver may report an error during service running.

nv_peer_mem loading error

NvPeerMemException

Minor

The NVLink or nv_peer_mem cannot be loaded.

Restore or reinstall the NVLink.

nv_peer_mem cannot be used.

Fabric Manager error

NvFabricManagerException

Minor

The BMS meets the NVLink conditions and NVLink is installed, but Fabric Manager is abnormal.

Restore or reinstall the NVLink.

NVLink cannot be used normally.

IB card error

InfinibandStatusException

Major

The IB card or its physical status is abnormal.

Transfer this issue to the hardware team for handling.

The IB card cannot work normally.

GPU throttle alarm

gpuClocksThrottleReasonsAlarm

Warning

  1. The GPU power may exceed the maximum operating power threshold (continuous full load). The clock frequency automatically decreases to prevent the GPU from being damaged.
  2. The GPU temperature may exceed the maximum operating temperature threshold (continuous full load). The clock frequency automatically decreases to reduce heat.
  3. The GPU may remain idle, with the clock frequency automatically decreasing to reduce power consumption.
  4. Hardware faults may cause a decrease in clock frequency.

Check whether the clock frequency decrease is caused by hardware faults. If yes, transfer it to the hardware team.

The GPU slows down, resulting in less powerful compute.

Pending page retirement for GPU DRAM ECC

gpuRetiredPagesPendingAlarm

Major

  1. An ECC error occurred on the hardware. DRAM pages need to be retired.
  2. An uncorrectable ECC error occurred on the GPU memory page, and the page needs to be retired. However, the page is suspended and has not been retired yet.
  1. View the event details and check whether the value of retired_pages.pending is yes.
  2. Restart the GPU for automatic retirement.

The GPU cannot work properly.

Pending row remapping for GPU DRAM ECC

gpuRemappedRowsAlarm

Major

Some rows in the GPU memory have errors and need to be remapped. The faulty rows must be mapped to standby resources.

  1. View the event metric "RemappedRow" to check if there are any rows that have been remapped.
  2. Restart the GPU for automatic retirement.

The GPU cannot work properly.

Insufficient resources for GPU DRAM ECC row remapping

gpuRowRemapperResourceAlarm

Major

  1. This event occurs on GPUs (Ampere and later architectures).
  2. The standby GPU memory row resources are exhausted, so row remapping cannot be continued.

Transfer the issue to the hardware team.

The GPU cannot work properly.

Correctable GPU DRAM ECC error

gpuDRAMCorrectableEccError

Major

  1. This event occurs on GPUs (Ampere and later architectures).
  2. A correctable ECC error occurs in the DRAM of the GPU. However, the ECC mechanism can automatically rectify the error and programs are not affected.
  1. View the event metric "ecc.errors.corrected.volatile" to check whether there are any correctable ECC error values.
  2. Restart the GPU for automatic retirement.

The GPU may not work properly.

Uncorrectable GPU DRAM ECC error

gpuDRAMUncorrectableEccError

Major

  1. This event occurs on GPUs (Ampere and later architectures).
  2. An uncorrectable ECC error occurs in the DRAM of the GPU. This error cannot be automatically corrected using the ECC mechanism. The verification process affects system stability and may cause program crashes.
  1. View the event metric "ecc.errors.uncorrected.volatile" to check whether there are any uncorrectable ECC error values.
  2. Restart the GPU for automatic retirement.

The GPU may not work properly.

Inconsistent GPU kernel versions

gpuKernelVersionInconsistencyAlarm

Major

Inconsistent GPU kernel versions.

During driver installation, the GPU driver is compiled based on the kernel at that time. If the kernel versions are identified inconsistent, the kernel has been customized after the driver installation. In this case, the driver would become unavailable and needs to be reinstalled.

  1. Run the following commands to rectify the issue:

    rmmod nvidia_drm

    rmmod nvidia_modeset

    rmmod nvidia

    Then, run nvidia-smi. If the command output is normal, the issue has been rectified.

  1. If the preceding solution does not work, rectify the fault by referring to Why Is the GPU Driver Unavailable?

The GPU cannot work properly.

GPU monitoring dependency not met

gpuCheckEnvFailedAlarm

Major

The plug-in cannot identify the GPU driver library path.

  1. Check whether the driver is installed.
  2. Check whether the driver installation directory has been customized. The driver needs to be installed in the default installation directory /usr/bin/.

The GPU metrics cannot be collected.

Initialization failure of the GPU monitoring driver library

gpuDriverInitFailedAlarm

Major

The GPU driver is unavailable.

Run nvidia-smi to check whether the driver is unavailable. If the driver is unavailable, reinstall the driver by referring to Manually Installing a Tesla Driver on a GPU-accelerated ECS.

The GPU metrics cannot be collected.

Initialization timeout of the GPU monitoring driver library

gpuDriverInitTAlarm

Major

The GPU driver initialization timed out (exceeding 10s).

  1. If the driver is not installed, install it by referring to Manually Installing a Tesla Driver on a GPU-accelerated ECS.
  2. If the driver is installed, run nvidia-smi to check whether the driver is available. If the driver is unavailable, reinstall the driver by referring to Manually Installing a Tesla Driver on a GPU-accelerated ECS.
  3. If the driver is properly installed, check whether the high-performance mode is enabled. If not, run nvidia-smi -pm 1 to enable it. P0 indicates the high-performance mode.

The GPU metrics cannot be collected.

GPU metric collection timeout

gpuCollectMetricTimeoutAlarm

Major

The GPU metric collection timed out (exceeding 10s).

  1. If the library API timed out, run nvidia-smi to check whether the driver is available. If the driver is unavailable, reinstall the driver by referring to Manually Installing a Tesla Driver on a GPU-accelerated ECS.
  2. If the command execution timed out, check the system logs and determine whether there is an issue with the system.

GPU monitoring metric data is missing. As a result, subsequent metrics may fail to be collected.

GPU handle lost

gpuDeviceHandleLost

Major

The GPU metric information cannot be obtained, and the GPU may be lost.

  1. Run nvidia-smi to check whether there are any errors reported.
  2. Run nvidia-smi -L to check whether the number of GPUs is the same as the server specifications.
  3. Submit a service ticket to contact on-call support.

All metrics of the GPU are lost.

Failed to listen to the XID of the GPU

gpuDeviceXidLost

Major

Failed to listen to the XID metric.

  1. Check whether the GPU is lost or damaged.
  2. Submit a service ticket to contact on-call support.

Failed to obtain XID-related metrics of the GPU.

Multiple NPU HBM ECC errors

NpuHbmMultiEccInfo

Warning

There are NPU HBM ECC errors.

This event is only a reference for other events. You do not need to handle it separately.

The NPU may not work properly.

ReadOnly issues in OS

ReadOnlyFileSystem

Critical

The file system %s is read-only.

Check the disk health status.

The files cannot be written.

NPU: driver and firmware not matching

NpuDriverFirmwareMismatch

Major

The NPU's driver and firmware do not match.

Obtain the matched version from the Ascend official website and reinstall it.

NPUs cannot be used.

NPU: Docker container environment check

NpuContainerEnvSystem

Major

Docker was unavailable.

Check if Docker is normal.

Docker cannot be used.

Major

The container plug-in Ascend-Docker-Runtime was not installed.

Install the container plug-in Ascend-Docker-Runtime, or the container cannot use Ascend cards.

NPUs cannot be attached to Docker containers.

Major

IP forwarding was not enabled in the OS.

Check the net.ipv4.ip_forward configuration in the /etc/sysctl.conf file.

Docker containers have network communication problems.

Major

The shared memory of the container was too small.

The default shared memory is 64 MB, which can be modified as needed.

Method 1

Modify the default-shm-size field in the /etc/docker/daemon.json configuration file.

Method 2

Use the --shm-size parameter in the docker run command to set the shared memory size of a container.

Distributed training will fail due to insufficient shared memory.

NPU: RoCE NIC down

RoCELinkStatusDown

Major

The RoCE link of NPU card %d was down.

Check the NPU RoCE network port status.

The NPU NIC is unavailable.

NPU: RoCE NIC health status abnormal

RoCEHealthStatusError

Major

The RoCE network health status of NPU %d was abnormal.

Check the health status of the NPU RoCE NIC.

The NPU NIC is unavailable.

NPU: RoCE NIC configuration file /etc/hccn.conf not found

HccnConfNotExisted

Major

The RoCE NIC configuration file /etc/hccn.conf was not found.

Check whether the NIC configuration file /etc/hccn.conf can be found.

The RoCE NIC is unavailable.

GPU: basic components abnormal

GpuEnvironmentSystem

Major

The nvidia-smi command was abnormal.

Check whether the GPU driver is normal.

The GPU driver is unavailable.

Major

The nvidia-fabricmanager version was inconsistent with the GPU driver version.

Check the GPU driver version and nvidia-fabricmanager version.

The nvidia-fabricmanager cannot work properly, affecting GPU usage.

Major

The container add-on nvidia-container-toolkit was not installed.

Install the container add-on nvidia-container-toolkit.

GPUs cannot be attached to Docker containers.

Local disk attachment inspection

MountDiskSystem

Major

The /etc/fstab file contains invalid UUIDs.

Ensure that the UUIDs in the /etc/fstab configuration file are correct, or the server may fail to be restarted.

The disk attachment process fails, preventing the server from restarting.

GPU: incorrectly configured dynamic route for Ant series server

GpuRouteConfigError

Major

The dynamic route of the NIC %s of an Ant series server was not configured or was incorrectly configured. CMD [ip route]: %s | CMD [ip route show table all]: %s.

Configure the RoCE NIC route correctly.

The NPU network communication will be interrupted.

NPU: RoCE port not split

RoCEUdpConfigError

Major

The RoCE UDP port was not split.

Check the RoCE UDP port configuration on the NPU.

The communication performance of NPUs is affected.

Warning of automatic system kernel upgrade

KernelUpgradeWarning

Major

Warning of automatic system kernel upgrade. Old version: %s; new version: %s.

System kernel upgrade may cause AI software exceptions. Check the system update logs and prevent the server from restarting.

The AI software may be unavailable.

NPU environment command detection

NpuToolsWarning

Major

The hccn_tool was unavailable.

Check whether the NPU driver is normal.

The IP address and gateway of the RoCE NIC cannot be configured.

Major

The npu-smi was unavailable.

Check whether the NPU driver is normal.

NPUs cannot be used.

Major

The ascend-dmi was unavailable.

Check whether ToolBox is properly installed.

ascend-dmi cannot be used for performance analysis.

Warning of an NPU driver exception

NpuDriverAbnormalWarning

Major

The NPU driver was abnormal.

Reinstall the NPU driver.

NPUs cannot be used.

GPU: invalid RoCE NIC configuration

GpuRoceNicConfigIncorrect

Major

The RoCE NIC of the GPU is incorrectly configured.

Contact O&M personnel.

The parameter plane network is abnormal, preventing the execution of the multi-node task.

Local disk replacement to be authorized

localdisk_recovery_inquiring

Major

The local disk is faulty. Local disk replacement authorization is in progress.

Authorize local disk replacement.

Local disks are unavailable.

Local disks being replaced

localdisk_recovery_executing

Major

Local disks are faulty and being replaced.

When the replacement is complete, check whether the local disks are available.

Local disks are unavailable.

Local disks replaced

localdisk_recovery_completed

Major

Local disks are faulty and have been replaced.

Wait until the services are running properly and check whether local disks are available.

None

Local disk replacement failed

localdisk_recovery_failed

Major

Local disks are faulty and fail to be replaced.

Contact O&M personnel.

Local disks are unavailable.