Monitored BMS Events

Description

Event monitoring provides event data reporting, query, and alarm reporting. You can create alarm rules for both system events and custom events. When specific events occur, Cloud Eye generates alarms for you.

Namespace

SYS.BMS

Monitored Events

**Table 1** Monitored BMS events
Event Name	Event ID	Event Severity	Description	Handling Solution	Impact
ECC uncorrectable error alarm generated on GPU SRAM	SRAMUncorrectableEccError	Major	There are ECC uncorrectable errors generated on GPU SRAM.	If services are affected, submit a service ticket.	The GPU hardware may be faulty. As a result, the SRAM is faulty, and services exit abnormally.
BMS restarted	osReboot	Major	The BMS was restarted On the management console By calling APIs	Deploy service applications in HA mode. After the BMS is restarted, check whether services recover.	Services are interrupted.
Unexpected restart	serverReboot	Major	The BMS was restarted unexpectedly due to: OS faults Hardware faults	Deploy service applications in HA mode. After the BMS is restarted, check whether services recover.	Services are interrupted.
BMS stopped	osShutdown	Major	The BMS was stopped On the management console By calling APIs	Deploy service applications in HA mode. After the BMS is restarted, check whether services recover.	Services are interrupted.
BMS unexpected shutdown	serverShutdown	Major	The BMS was stopped unexpectedly due to: Unexpected power-off Hardware faults	Deploy service applications in HA mode. After the BMS is restarted, check whether services recover.	Services are interrupted.
Network disconnection	linkDown	Major	The BMS network was disconnected due to: Unexpected BMS stop or restart Switch faults Gateway faults	Deploy service applications in HA mode. After the BMS is restarted, check whether services recover.	Services are interrupted.
PCIe error	pcieError	Major	The PCIe device on the BMS was faulty, including: Main board faults PCIe device faults	Deploy service applications in HA mode. After the BMS is restarted, check whether services recover.	The network or disk read/write services are affected.
Disk fault	diskError	Major	The disks of the BMS were faulty, including: Disk backplane faults Disk faults	Deploy service applications in HA mode. After the BMS is restarted, check whether services recover.	Data read/write services are affected, or the BMS cannot be started.
EVS error	storageError	Major	The BMS failed to connect to EVS disks due to: SDI card faults Remote storage device faults	Deploy service applications in HA mode. After the BMS is restarted, check whether services recover.	Data read/write services are affected, or the BMS cannot be started.
infoROM alarm generated on GPU	gpuInfoROMAlarm	Major	The infoROM of the GPU is abnormal. ROM is an important storage area of the GPU firmware and stores key data loaded during startup.	Non-critical services can continue to use the GPU. For critical services, submit a service ticket to resolve this issue. Restart the VM and check that the issue is not caused by a temporary cache or communication error. If the fault persists after the restart, the hardware may be faulty. Submit a service ticket to check whether the GPU needs to be replaced.	Services will not be affected. If ECC errors are reported on a GPU, faulty pages may not be automatically retired and services are affected.
Double-bit ECC alarm generated on GPU	doubleBitEccError	Major	A double-bit error occurs in the ECC memory of the GPU. The ECC cannot correct the error, which may cause program breakdown.	If services are interrupted, restart the services. If services cannot be restarted, restart the VM where services are running. If services still cannot be restored, submit a service ticket.	Services may be interrupted. After faulty pages are retired, the GPU can continue to be used.
Too many retired pages	gpuTooManyRetiredPagesAlarm	Major	An ECC page retirement error occurred on the GPU. When an uncorrectable ECC error occurs on a GPU memory page, the GPU marks the page as retired.	If services are affected, submit a service ticket.	If there are too many ECC errors, services may be affected. If there are too many retired pages and the GPU memory capacity decreases too much, the system performance may deteriorate. If there are too many retired pages and the GPU memory capacity decreases too much, the system may run unstably.
ECC alarm generated on GPU A100	gpuA100EccAlarm	Major	An ECC error occurred on the GPU.	If services are interrupted, restart the services. If services cannot be restarted, restart the VM where services are running. If services still cannot be restored, submit a service ticket.	Services may be interrupted. After faulty pages are retired, the GPU can continue to be used.
ECC alarm generated on GPU Ant1	gpuAnt1EccAlarm	Major	An ECC error occurred on the GPU.	If services are interrupted, restart the services. If services cannot be restarted, restart the VM where services are running. If services still cannot be restored, submit a service ticket.	Services may be interrupted. After faulty pages are retired, the GPU can continue to be used.
GPU ECC memory page retirement failure	eccPageRetirementRecordingFailure	Major	Automatic page retirement failed due to ECC errors.	If services are interrupted, restart the services. If services cannot be restarted, restart the VM where services are running. If services still cannot be restored, submit a service ticket.	Services may be interrupted, and memory page retirement fails. As a result, services cannot no longer use the GPU.
GPU ECC page retirement alarm generated	eccPageRetirementRecordingEvent	Minor	Memory pages are automatically retired due to ECC errors.	1. If services are interrupted, restart the services. 2. If services cannot be restarted, restart the VM where services are running. 3. If services still cannot be restored, submit a service ticket.	Generally, this alarm is generated together with the ECC error alarm. If this alarm is generated independently, services are not affected.
Too many single-bit ECC errors on GPU	highSingleBitEccErrorRate	Major	There are too many single-bit errors occurring in the ECC memory of the GPU.	If services are interrupted, restart the services. If services cannot be restarted, restart the VM where services are running. If services still cannot be restored, submit a service ticket.	Single-bit errors can be automatically rectified. These errors generally do not affect GPU-related applications.
GPU card not found	gpuDriverLinkFailureAlarm	Major	A GPU link is normal, but it cannot be found by the NVIDIA driver.	1. Try to restart the VM to restore your services. 2. If services still cannot be restored, submit a service ticket.	The GPU cannot be found.
GPU link faulty	gpuPcieLinkFailureAlarm	Major	GPU hardware information cannot be queried through lspci due to a GPU link fault.	If services are affected, submit a service ticket.	The driver cannot use the GPU.
VM GPU lost	vmLostGpuAlarm	Major	The number of GPUs on the VM is less than the number specified in the specifications.	If services are affected, submit a service ticket.	GPUs get lost.
GPU memory page faulty	gpuMemoryPageFault	Major	The GPU memory page is faulty, which may be caused by applications, drivers, or hardware.	If services are affected, submit a service ticket.	The GPU hardware may be faulty. As a result, the GPU memory is faulty, and services exit abnormally.
GPU image engine faulty	graphicsEngineException	Major	The GPU image engine is faulty, which may be caused by applications, drivers, or hardware.	If services are affected, submit a service ticket.	The GPU hardware may be faulty. As a result, the image engine is faulty, and services exit abnormally.
GPU temperature too high	highTemperatureEvent	Major	The GPU temperature is too high.	If services are affected, submit a service ticket.	If the GPU temperature exceeds the threshold, the GPU performance may deteriorate.
GPU NVLink faulty	nvlinkError	Major	A hardware fault occurs on the NVLink.	If services are affected, submit a service ticket.	The NVLink link is faulty and unavailable.
System maintenance inquiring	system_maintenance_inquiring	Major	The scheduled BMS maintenance task is being inquired.	Authorize the maintenance.	None
System maintenance waiting	system_maintenance_scheduled	Major	The scheduled BMS maintenance task is waiting to be executed.	Clarify the impact on services during the execution window.	None
System maintenance canceled	system_maintenance_canceled	Major	The scheduled BMS maintenance is canceled.	None	None
System maintenance executing	system_maintenance_executing	Major	BMSs are being maintained as scheduled.	After the maintenance is complete, check whether services are affected.	Services are interrupted.
System maintenance completed	system_maintenance_completed	Major	The scheduled BMS maintenance is completed.	Wait until the BMSs become available and check whether services recover.	None
System maintenance failure	system_maintenance_failed	Major	The scheduled BMS maintenance task failed.	Contact O&M personnel.	Services are interrupted.
GPU Xid error	commonXidError	Major	A Xid event alarm was generated, indicating a GPU problem.	If services are affected, submit a service ticket.	A Xid error is caused by GPU hardware, driver, or application problems, which may result in abnormal service exit.
NPU: device not found by npu-smi info	NPUSMICardNotFound	Major	The Ascend driver is faulty, or the NPU is disconnected.	Transfer this issue to the Ascend or hardware team for handling.	The NPU cannot be used normally.
NPU: PCIe link error	PCIeErrorFound	Major	The lspci command returns rev ff, indicating that the NPU is abnormal.	Restart the BMS. If the issue persists, transfer it to the hardware team for processing.	The NPU cannot be used normally.
NPU: device not found by lspci	LspciCardNotFound	Major	The NPU is disconnected.	Transfer this issue to the hardware team for handling.	The NPU cannot be used normally.
NPU: overtemperature	TemperatureOverUpperLimit	Major	The temperature of DDR or software is too high.	Stop services, restart the BMS, check the heat dissipation system, and reset the devices.	The BMS may be powered off, and devices may not be found.
NPU: uncorrectable ECC error	UncorrectableEccErrorCount	Major	There are uncorrectable ECC errors on the NPU.	If services are affected, replace the NPU with another one.	Services may be interrupted.
NPU: request for BMS restart	RebootVirtualMachine	Warning	The BMS needs to be restarted.	Collect the required information and restart the BMS.	Services may be interrupted.
NPU: request for SoC reset	ResetSOC	Warning	The SoC needs to be reset.	Collect the required information and reset the SoC.	Services may be interrupted.
NPU: request for restart AI process	RestartAIProcess	Warning	The AI process needs to be restarted.	Collect the required information and restart the AI process.	The current AI task will be interrupted.
NPU: error codes	NPUErrorCodeWarning	Major	There are a large number of NPU error codes indicating major or higher-level errors. You can further locate the faults based on the error codes.	Locate the faults according to the Black Box Error Code Information List and Health Management Error Definition.	Services may be interrupted.
nvidia-smi suspended	nvidiaSmiHangEvent	Major	nvidia-smi timed out.	If services are affected, submit a service ticket.	The driver may report an error during service running.
nv_peer_mem loading error	NvPeerMemException	Minor	The NVLink or nv_peer_mem cannot be loaded.	Restore or reinstall the NVLink.	nv_peer_mem cannot be used.
Fabric Manager error	NvFabricManagerException	Minor	The BMS meets the NVLink conditions and NVLink is installed, but Fabric Manager is abnormal.	Restore or reinstall the NVLink.	NVLink cannot be used normally.
IB card error	InfinibandStatusException	Major	The IB card or its physical status is abnormal.	Transfer this issue to the hardware team for handling.	The IB card cannot work normally.
GPU throttle alarm	gpuClocksThrottleReasonsAlarm	Warning	The GPU power may exceed the maximum operating power threshold (continuous full load). The clock frequency automatically decreases to prevent the GPU from being damaged. The GPU temperature may exceed the maximum operating temperature threshold (continuous full load). The clock frequency automatically decreases to reduce heat. The GPU may remain idle, with the clock frequency automatically decreasing to reduce power consumption. Hardware faults may cause a decrease in clock frequency.	Check whether the clock frequency decrease is caused by hardware faults. If yes, transfer it to the hardware team.	The GPU slows down, resulting in less powerful compute.
Pending page retirement for GPU DRAM ECC	gpuRetiredPagesPendingAlarm	Major	An ECC error occurred on the hardware. DRAM pages need to be retired. An uncorrectable ECC error occurred on the GPU memory page, and the page needs to be retired. However, the page is suspended and has not been retired yet.	View the event details and check whether the value of retired_pages.pending is yes. Restart the GPU for automatic retirement.	The GPU cannot work properly.
Pending row remapping for GPU DRAM ECC	gpuRemappedRowsAlarm	Major	Some rows in the GPU memory have errors and need to be remapped. The faulty rows must be mapped to standby resources.	View the event metric "RemappedRow" to check if there are any rows that have been remapped. Restart the GPU for automatic retirement.	The GPU cannot work properly.
Insufficient resources for GPU DRAM ECC row remapping	gpuRowRemapperResourceAlarm	Major	This event occurs on GPUs (Ampere and later architectures). The standby GPU memory row resources are exhausted, so row remapping cannot be continued.	Transfer the issue to the hardware team.	The GPU cannot work properly.
Correctable GPU DRAM ECC error	gpuDRAMCorrectableEccError	Major	This event occurs on GPUs (Ampere and later architectures). A correctable ECC error occurs in the DRAM of the GPU. However, the ECC mechanism can automatically rectify the error and programs are not affected.	View the event metric "ecc.errors.corrected.volatile" to check whether there are any correctable ECC error values. Restart the GPU for automatic retirement.	The GPU may not work properly.
Uncorrectable GPU DRAM ECC error	gpuDRAMUncorrectableEccError	Major	This event occurs on GPUs (Ampere and later architectures). An uncorrectable ECC error occurs in the DRAM of the GPU. This error cannot be automatically corrected using the ECC mechanism. The verification process affects system stability and may cause program crashes.	View the event metric "ecc.errors.uncorrected.volatile" to check whether there are any uncorrectable ECC error values. Restart the GPU for automatic retirement.	The GPU may not work properly.
Inconsistent GPU kernel versions	gpuKernelVersionInconsistencyAlarm	Major	Inconsistent GPU kernel versions. During driver installation, the GPU driver is compiled based on the kernel at that time. If the kernel versions are identified inconsistent, the kernel has been customized after the driver installation. In this case, the driver would become unavailable and needs to be reinstalled.	Run the following commands to rectify the issue: rmmod nvidia_drm rmmod nvidia_modeset rmmod nvidia Then, run nvidia-smi. If the command output is normal, the issue has been rectified. If the preceding solution does not work, rectify the fault by referring to Why Is the GPU Driver Unavailable?	The GPU cannot work properly.
GPU monitoring dependency not met	gpuCheckEnvFailedAlarm	Major	The plug-in cannot identify the GPU driver library path.	Check whether the driver is installed. Check whether the driver installation directory has been customized. The driver needs to be installed in the default installation directory /usr/bin/.	The GPU metrics cannot be collected.
Initialization failure of the GPU monitoring driver library	gpuDriverInitFailedAlarm	Major	The GPU driver is unavailable.	Run nvidia-smi to check whether the driver is unavailable. If the driver is unavailable, reinstall the driver by referring to Manually Installing a Tesla Driver on a GPU-accelerated ECS.	The GPU metrics cannot be collected.
Initialization timeout of the GPU monitoring driver library	gpuDriverInitTAlarm	Major	The GPU driver initialization timed out (exceeding 10s).	If the driver is not installed, install it by referring to Manually Installing a Tesla Driver on a GPU-accelerated ECS. If the driver is installed, run nvidia-smi to check whether the driver is available. If the driver is unavailable, reinstall the driver by referring to Manually Installing a Tesla Driver on a GPU-accelerated ECS. If the driver is properly installed, check whether the high-performance mode is enabled. If not, run nvidia-smi -pm 1 to enable it. P0 indicates the high-performance mode.	The GPU metrics cannot be collected.
GPU metric collection timeout	gpuCollectMetricTimeoutAlarm	Major	The GPU metric collection timed out (exceeding 10s).	If the library API timed out, run nvidia-smi to check whether the driver is available. If the driver is unavailable, reinstall the driver by referring to Manually Installing a Tesla Driver on a GPU-accelerated ECS. If the command execution timed out, check the system logs and determine whether there is an issue with the system.	GPU monitoring metric data is missing. As a result, subsequent metrics may fail to be collected.
GPU handle lost	gpuDeviceHandleLost	Major	The GPU metric information cannot be obtained, and the GPU may be lost.	Run nvidia-smi to check whether there are any errors reported. Run nvidia-smi -L to check whether the number of GPUs is the same as the server specifications. Submit a service ticket to contact on-call support.	All metrics of the GPU are lost.
Failed to listen to the XID of the GPU	gpuDeviceXidLost	Major	Failed to listen to the XID metric.	Check whether the GPU is lost or damaged. Submit a service ticket to contact on-call support.	Failed to obtain XID-related metrics of the GPU.
Multiple NPU HBM ECC errors	NpuHbmMultiEccInfo	Warning	There are NPU HBM ECC errors.	This event is only a reference for other events. You do not need to handle it separately.	The NPU may not work properly.
ReadOnly issues in OS	ReadOnlyFileSystem	Critical	The file system %s is read-only.	Check the disk health status.	The files cannot be written.
NPU: driver and firmware not matching	NpuDriverFirmwareMismatch	Major	The NPU's driver and firmware do not match.	Obtain the matched version from the Ascend official website and reinstall it.	NPUs cannot be used.
NPU: Docker container environment check	NpuContainerEnvSystem	Major	Docker was unavailable.	Check if Docker is normal.	Docker cannot be used.
		Major	The container plug-in Ascend-Docker-Runtime was not installed.	Install the container plug-in Ascend-Docker-Runtime, or the container cannot use Ascend cards.	NPUs cannot be attached to Docker containers.
		Major	IP forwarding was not enabled in the OS.	Check the net.ipv4.ip_forward configuration in the /etc/sysctl.conf file.	Docker containers have network communication problems.
		Major	The shared memory of the container was too small.	The default shared memory is 64 MB, which can be modified as needed. Method 1 Modify the default-shm-size field in the /etc/docker/daemon.json configuration file. Method 2 Use the --shm-size parameter in the docker run command to set the shared memory size of a container.	Distributed training will fail due to insufficient shared memory.
NPU: RoCE NIC down	RoCELinkStatusDown	Major	The RoCE link of NPU card %d was down.	Check the NPU RoCE network port status.	The NPU NIC is unavailable.
NPU: RoCE NIC health status abnormal	RoCEHealthStatusError	Major	The RoCE network health status of NPU %d was abnormal.	Check the health status of the NPU RoCE NIC.	The NPU NIC is unavailable.
NPU: RoCE NIC configuration file /etc/hccn.conf not found	HccnConfNotExisted	Major	The RoCE NIC configuration file /etc/hccn.conf was not found.	Check whether the NIC configuration file /etc/hccn.conf can be found.	The RoCE NIC is unavailable.
GPU: basic components abnormal	GpuEnvironmentSystem	Major	The nvidia-smi command was abnormal.	Check whether the GPU driver is normal.	The GPU driver is unavailable.
		Major	The nvidia-fabricmanager version was inconsistent with the GPU driver version.	Check the GPU driver version and nvidia-fabricmanager version.	The nvidia-fabricmanager cannot work properly, affecting GPU usage.
		Major	The container add-on nvidia-container-toolkit was not installed.	Install the container add-on nvidia-container-toolkit.	GPUs cannot be attached to Docker containers.
Local disk attachment inspection	MountDiskSystem	Major	The /etc/fstab file contains invalid UUIDs.	Ensure that the UUIDs in the /etc/fstab configuration file are correct, or the server may fail to be restarted.	The disk attachment process fails, preventing the server from restarting.
GPU: incorrectly configured dynamic route for Ant series server	GpuRouteConfigError	Major	The dynamic route of the NIC %s of an Ant series server was not configured or was incorrectly configured. CMD [ip route]: %s \| CMD [ip route show table all]: %s.	Configure the RoCE NIC route correctly.	The NPU network communication will be interrupted.
NPU: RoCE port not split	RoCEUdpConfigError	Major	The RoCE UDP port was not split.	Check the RoCE UDP port configuration on the NPU.	The communication performance of NPUs is affected.
Warning of automatic system kernel upgrade	KernelUpgradeWarning	Major	Warning of automatic system kernel upgrade. Old version: %s; new version: %s.	System kernel upgrade may cause AI software exceptions. Check the system update logs and prevent the server from restarting.	The AI software may be unavailable.
NPU environment command detection	NpuToolsWarning	Major	The hccn_tool was unavailable.	Check whether the NPU driver is normal.	The IP address and gateway of the RoCE NIC cannot be configured.
		Major	The npu-smi was unavailable.	Check whether the NPU driver is normal.	NPUs cannot be used.
		Major	The ascend-dmi was unavailable.	Check whether ToolBox is properly installed.	ascend-dmi cannot be used for performance analysis.
Warning of an NPU driver exception	NpuDriverAbnormalWarning	Major	The NPU driver was abnormal.	Reinstall the NPU driver.	NPUs cannot be used.
GPU: invalid RoCE NIC configuration	GpuRoceNicConfigIncorrect	Major	The RoCE NIC of the GPU is incorrectly configured.	Contact O&M personnel.	The parameter plane network is abnormal, preventing the execution of the multi-node task.
Local disk replacement to be authorized	localdisk_recovery_inquiring	Major	The local disk is faulty. Local disk replacement authorization is in progress.	Authorize local disk replacement.	Local disks are unavailable.
Local disks being replaced	localdisk_recovery_executing	Major	Local disks are faulty and being replaced.	When the replacement is complete, check whether the local disks are available.	Local disks are unavailable.
Local disks replaced	localdisk_recovery_completed	Major	Local disks are faulty and have been replaced.	Wait until the services are running properly and check whether local disks are available.	None
Local disk replacement failed	localdisk_recovery_failed	Major	Local disks are faulty and fail to be replaced.	Contact O&M personnel.	Local disks are unavailable.