Monitored ECS Events

Scenarios

Event monitoring provides data collection, query, and alarm reporting for events. You can create alarm rules to get notified when specific events happen.

This section describes the ECS events monitored by Cloud Eye.

Namespace

SYS.ECS

Monitored Events

**Table 1** ECS
Event Name	Event ID	Event Severity	Description	Solution	Impact
Restart triggered due to system faults	startAutoRecovery	Major	ECSs on a faulty host would be automatically migrated to another properly-running host. During the migration, the ECSs were restarted.	Wait for the event to end and check whether services are affected.	Services may be interrupted.
Redeployment triggered by system faults completed	endAutoRecovery	Major	ECSs are recovered after the automatic migration is complete.	This event indicates that the ECS has recovered and has been working properly.	None
Auto recovery timeout (being processed on the backend)	faultAutoRecovery	Major	Migrating the ECS to a normal host timed out.	Migrate services to another ECS.	Services are interrupted.
GPU link fault	GPULinkFault	Critical	The GPU of the host running the ECS was faulty or recovering from a fault.	Deploy service applications in HA mode. After the GPU fault is rectified, check whether services recover.	Services are interrupted.
ECS deleted	deleteServer	Major	The ECS was deleted: On the management console By calling APIs	Check whether the operation was intentionally performed by a user.	Services are interrupted.
ECS restarted	rebootServer	Minor	The ECS was restarted: On the management console By calling APIs	Check whether the operation was intentionally performed by a user. Deploy service applications in HA mode. After the ECS starts up, check whether services recover.	Services are interrupted.
ECS stopped	stopServer	Minor	The ECS was stopped: On the management console By calling APIs NOTE: The ECS is stopped only after CTS is enabled.	Check whether the operation was intentionally performed by a user. Deploy service applications in HA mode. After the ECS starts up, check whether services recover.	Services are interrupted.
NIC deleted	deleteNic	Major	The ECS NIC was deleted: On the management console By calling APIs	Check whether the operation was intentionally performed by a user. Deploy service applications in HA mode. After the NIC is deleted, check whether services recover.	If the NIC is deleted, services may be interrupted.
ECS resized	resizeServer	Minor	The ECS specifications were modified: On the management console By calling APIs	Check whether the operation was intentionally performed by a user. Deploy service applications in HA mode. After the ECS is resized, check whether services recover.	Services are interrupted.
GuestOS restarted	RestartGuestOS	Minor	The guest OS was restarted.	Contact O&M personnel.	Services may be interrupted.
ECS failure caused by system faults	VMFaultsByHostProcessExceptions	Critical	The host where the ECS resides is faulty. The system will automatically try to start the ECS.	After the ECS is started, check whether this ECS and services on it can run properly.	The ECS is faulty.
Startup failure	faultPowerOn	Major	The ECS failed to start.	Start the ECS again. If the problem persists, contact O&M personnel.	The ECS cannot start.
Host breakdown risk	hostMayCrash	Major	The host where the ECS resides may break down, and the risk cannot be prevented through live migration due to some reasons.	Migrate services running on the ECS first and then delete or stop the ECS. Start the ECS only after the O&M personnel handle the risk.	The host may break down, causing service interruptions.
Scheduled migration completed	instance_migrate_completed	Major	Scheduled ECS migration is complete.	Wait until the ECS becomes available and check whether services recover.	Services may be interrupted.
Scheduled migration being executed	instance_migrate_executing	Major	ECSs are being migrated as scheduled.	Wait until the event is complete and check whether services are affected.	Services may be interrupted.
Scheduled migration canceled	instance_migrate_canceled	Major	Scheduled ECS migration is canceled.	None	None
Scheduled migration failed	instance_migrate_failed	Major	ECSs failed to be migrated as scheduled.	Contact O&M personnel.	Services are interrupted.
Scheduled migration to be executed	instance_migrate_scheduled	Major	ECSs will be migrated as scheduled.	Check the impact on services during the execution window.	None
Scheduled specification modification failed	instance_resize_failed	Major	Specifications failed to be modified as scheduled.	Contact O&M personnel.	Services are interrupted.
Scheduled specification modification completed	instance_resize_completed	Major	Scheduled specification modification is complete.	None	None
Scheduled specification modification being executed	instance_resize_executing	Major	Specifications are being modified as scheduled.	Wait until the event is complete and check whether the specifications were modified.	Services are interrupted.
Scheduled specification modification canceled	instance_resize_canceled	Major	Scheduled specification modification is canceled.	None	None
Scheduled specification modification to be executed	instance_resize_scheduled	Major	Specifications will be modified as scheduled.	Check the impact on services during the execution window.	None
Scheduled redeployment to be executed	instance_redeploy_scheduled	Major	ECSs will be redeployed on new hosts as scheduled.	Check the impact on services during the execution window.	None
Scheduled restart to be executed	instance_reboot_scheduled	Major	ECSs will be restarted as scheduled.	Check the impact on services during the execution window.	None
Scheduled stop to be executed	instance_stop_scheduled	Major	ECSs will be stopped as scheduled.	Check the impact on services during the execution window.	None
ECC uncorrectable error alarm generated on GPU SRAM	SRAMUncorrectableEccError	Major	There are ECC uncorrectable errors generated on GPU SRAM.	If services are affected, submit a service ticket.	The GPU hardware may be faulty. As a result, the SRAM is faulty, and services exit abnormally.
FPGA link fault	FPGALinkFault	Critical	The FPGA of the host running ECSs was faulty or recovering from a fault.	Deploy service applications in HA mode. After the FPGA fault is rectified, check whether services recover.	Services are interrupted.
Scheduled redeployment to be authorized	instance_redeploy_inquiring	Major	Scheduled ECS redeployment is to be authorized.	Authorize scheduled redeployment.	None
Local disk replacement canceled	localdisk_recovery_canceled	Major	Local disk replacement is canceled.	None	None
Local disk replacement to be executed	localdisk_recovery_scheduled	Major	Faulty local disks are waiting to be replaced.	Check the impact on services during the execution window.	None
Xid event alarm generated on GPU	commonXidError	Major	A Xid event alarm was generated on the GPU.	If services are affected, submit a service ticket.	A Xid error is caused by GPU hardware, driver, or application problems, which may cause services to exit abnormally.
nvidia-smi suspended	nvidiaSmiHangEvent	Major	nvidia-smi timed out.	If services are affected, submit a service ticket.	The driver may report an error during service running.
NPU: uncorrectable ECC error	UncorrectableEccErrorCount	Major	There are uncorrectable ECC errors on the NPU.	If services are affected, replace the NPU with another one.	Services may be interrupted.
Scheduled redeployment canceled	instance_redeploy_canceled	Major	Scheduled redeployment is canceled.	None	None
Scheduled redeployment being executed	instance_redeploy_executing	Major	ECSs are being redeployed on a new host as scheduled.	Wait until the event is complete and check whether services are affected.	Services are interrupted.
Scheduled redeployment completed	instance_redeploy_completed	Major	Scheduled redeployment is complete.	Wait until the redeployed ECSs are available and check whether services are affected.	None
Scheduled redeployment failed	instance_redeploy_failed	Major	ECSs failed to be redeployed as scheduled.	Contact O&M personnel.	Services are interrupted.
Local disk replacement to be authorized	localdisk_recovery_inquiring	Major	Local disks are faulty.	Authorize local disk replacement.	Local disks are unavailable.
Local disks being replaced	localdisk_recovery_executing	Major	Local disks are faulty.	Wait until the local disks are replaced and check whether the local disks are available.	Local disks are unavailable.
Local disks replaced	localdisk_recovery_completed	Major	Local disks are faulty.	Check whether local disks are available.	None
Local disk replacement failed	localdisk_recovery_failed	Major	Local disks are faulty.	Contact O&M personnel.	Local disks are unavailable.
GPU throttle alarm	gpuClocksThrottleReasonsAlarm	Informational	The GPU power may exceed the maximum operating power threshold (continuous full load). The clock frequency automatically decreases to prevent the GPU from being damaged. The GPU temperature may exceed the maximum operating temperature threshold (continuous full load). The clock frequency automatically decreases to reduce heat. The GPU may remain idle, with the clock frequency automatically decreasing to reduce power consumption. Hardware faults may cause a decrease in clock frequency.	Check whether the clock frequency decrease is caused by hardware faults. If yes, transfer it to the hardware team.	The GPU slows down, resulting in less powerful compute.
Pending page retirement for GPU DRAM ECC	gpuRetiredPagesPendingAlarm	Major	An ECC error occurred on the hardware. DRAM pages need to be retired. An uncorrectable ECC error occurred on the GPU memory page, and the page needs to be retired. However, the page is suspended and has not been retired yet.	View the event details and check whether the value of retired_pages.pending is yes. Restart the GPU for automatic retirement.	The GPU cannot work properly.
Pending row remapping for GPU DRAM ECC	gpuRemappedRowsAlarm	Major	Some rows in the GPU memory have errors and need to be remapped. The faulty rows must be mapped to standby resources.	View the event metric "RemappedRow" to check if there are any rows that have been remapped. Restart the GPU for automatic retirement.	The GPU cannot work properly.
Insufficient resources for GPU DRAM ECC row remapping	gpuRowRemapperResourceAlarm	Major	This event occurs on GPUs (Ampere and later architectures). The standby GPU memory row resources are exhausted, so row remapping cannot be continued.	Transfer the issue to the hardware team.	The GPU cannot work properly.
Correctable GPU DRAM ECC error	gpuDRAMCorrectableEccError	Major	This event occurs on GPUs (Ampere and later architectures). A correctable ECC error occurs in the DRAM of the GPU. However, the ECC mechanism can automatically rectify the error and programs are not affected.	View the event metric "ecc.errors.corrected.volatile" to check whether there are any correctable ECC error values. Restart the GPU for automatic retirement.	The GPU may not work properly.
Uncorrectable GPU DRAM ECC error	gpuDRAMUncorrectableEccError	Major	This event occurs on GPUs (Ampere and later architectures). An uncorrectable ECC error occurs in the DRAM of the GPU. This error cannot be automatically corrected using the ECC mechanism. The verification process affects system stability and may cause program crashes.	View the event metric "ecc.errors.uncorrected.volatile" to check whether there are any uncorrectable ECC error values. Restart the GPU for automatic retirement.	The GPU may not work properly.
Inconsistent GPU kernel versions	gpuKernelVersionInconsistencyAlarm	Major	The current kernel version of the GPU is inconsistent with that during the driver installation. During driver installation, the GPU driver is compiled based on the kernel at that time. If the kernel versions are identified as inconsistent, the kernel has been customized after the driver installation. In this case, the driver would become unavailable and needs to be reinstalled.	Run the following commands to rectify the issue: rmmod nvidia_drm rmmod nvidia_modeset rmmod nvidia Then, run nvidia-smi. If the command output is normal, the issue has been rectified. If the preceding solution does not work, rectify the fault by referring to Why Is the GPU Driver Unavailable?	The GPU cannot work properly.
GPU monitoring dependency not met	gpuCheckEnvFailedAlarm	Major	The plug-in cannot identify the GPU driver library path.	Check whether the driver is installed. Check whether the driver installation directory has been customized. The driver needs to be installed in the default installation directory /usr/bin/.	The GPU metrics cannot be collected.
Initialization failure of the GPU monitoring driver library	gpuDriverInitFailedAlarm	Major	The GPU driver is unavailable.	Run nvidia-smi to check whether the driver is unavailable. If the driver is unavailable, reinstall the driver by referring to Manually Installing a Tesla Driver on a GPU-accelerated ECS.	The GPU metrics cannot be collected.
Initialization timeout of the GPU monitoring driver library	gpuDriverInitTAlarm	Major	The GPU driver initialization timed out (exceeding 10s).	If the driver is not installed, install it by referring to Manually Installing a Tesla Driver on a GPU-accelerated ECS. If the driver is installed, run nvidia-smi to check whether the driver is available. If the driver is unavailable, reinstall the driver by referring to Manually Installing a Tesla Driver on a GPU-accelerated ECS. If the driver is properly installed, check whether the high-performance mode is enabled. If not, run nvidia-smi -pm 1 to enable it. P0 indicates the high-performance mode.	The GPU metrics cannot be collected.
GPU metric collection timeout	gpuCollectMetricTimeoutAlarm	Major	The GPU metric collection timed out (exceeding 10s).	If the library API timed out, run nvidia-smi to check whether the driver is available. If the driver is unavailable, reinstall the driver by referring to Manually Installing a Tesla Driver on a GPU-accelerated ECS. If the command execution timed out, check the system logs and determine whether there is an issue with the system.	GPU monitoring metric data is missing. As a result, subsequent metrics may fail to be collected.
GPU handle lost	gpuDeviceHandleLost	Major	The GPU metric information cannot be obtained, and the GPU may be lost.	Run nvidia-smi to check whether there are any errors reported. Run nvidia-smi -L to check whether the number of GPUs is the same as the server specifications. Submit a service ticket to contact on-call support.	All metrics of the GPU are lost.
Failed to listen to the XID of the GPU	gpuDeviceXidLost	Major	Failed to listen to the Xid metric.	Check whether the GPU is lost or damaged. Submit a service ticket to contact on-call support.	Failed to obtain Xid-related metrics of the GPU.
ReadOnly issues in OS	ReadOnlyFileSystem	Critical	The file system %s is read-only.	Check the disk health status.	The file cannot be written to.
NPU: driver and firmware not matching	NpuDriverFirmwareMismatch	Major	The NPU's driver and firmware do not match.	Obtain the matched version from the Ascend official website and reinstall it.	NPUs cannot be used.
NPU: Docker container environment check	NpuContainerEnvSystem	Major	Docker was unavailable.	Check if Docker is normal.	Docker cannot be used.
		Major	The container plug-in Ascend-Docker-Runtime was not installed.	Install the container plug-in Ascend-Docker-Runtime, or the container cannot use Ascend cards.	NPUs cannot be attached to Docker containers.
		Major	IP forwarding was not enabled in the OS.	Check the net.ipv4.ip_forward configuration in the /etc/sysctl.conf file.	Docker containers have network communication problems.
		Major	The shared memory of the container was too small.	The default shared memory is 64 MB, which can be modified as needed. Method 1 Modify the default-shm-size field in the /etc/docker/daemon.json configuration file. Method 2 Use the --shm-size parameter in the docker run command to set the shared memory size of a container.	Distributed training will fail due to insufficient shared memory.
NPU: RoCE NIC down	RoCELinkStatusDown	Major	The RoCE link of NPU card %d was down.	Check the NPU RoCE network port status.	The NPU NIC is unavailable.
NPU: RoCE NIC health status abnormal	RoCEHealthStatusError	Major	The RoCE network health status of NPU %d was abnormal.	Check the health status of the NPU RoCE NIC.	The NPU NIC is unavailable.
NPU: RoCE NIC configuration file /etc/hccn.conf not found	HccnConfNotExisted	Major	The RoCE NIC configuration file /etc/hccn.conf was not found.	Check whether the NIC configuration file /etc/hccn.conf can be found.	The RoCE NIC is unavailable.
GPU: basic components abnormal	GpuEnvironmentSystem	Major	The nvidia-smi command was abnormal.	Check whether the GPU driver is normal.	The GPU driver is unavailable.
		Major	The nvidia-fabricmanager version was inconsistent with the GPU driver version.	Check the GPU driver version and nvidia-fabricmanager version.	The nvidia-fabricmanager cannot work properly, affecting GPU usage.
		Major	The container add-on nvidia-container-toolkit was not installed.	Install the container add-on nvidia-container-toolkit.	GPUs cannot be attached to Docker containers.
Local disk attachment inspection	MountDiskSystem	Major	The /etc/fstab file contains invalid UUIDs.	Ensure that the UUIDs in the /etc/fstab configuration file are correct, or the server may fail to be restarted.	The disk attachment error resulted in abnormal server restart.
GPU: incorrectly configured dynamic route for Ant series server	GpuRouteConfigError	Major	The dynamic route of the NIC %s of an Ant series server was not configured or was incorrectly configured. CMD [ip route]: %s \| CMD [ip route show table all]: %s.	Configure the RoCE NIC route correctly.	The NPU network communication will be interrupted.
NPU: RoCE port not split	RoCEUdpConfigError	Major	The RoCE UDP port was not split.	Check the RoCE UDP port configuration on the NPU.	The communication performance of NPUs is affected.
Warning of automatic system kernel upgrade	KernelUpgradeWarning	Major	Warning of automatic system kernel upgrade. Old version: %s; new version: %s.	System kernel upgrade may cause AI software exceptions. Check the system update logs and prevent the server from restarting.	The AI software may be unavailable.
NPU environment command detection	NpuToolsWarning	Major	The hccn_tool was unavailable.	Check whether the NPU driver is normal.	The IP address and gateway of the RoCE NIC cannot be configured.
		Major	The npu-smi was unavailable.	Check whether the NPU driver is normal.	NPUs cannot be used.
		Major	The ascend-dmi was unavailable.	Check whether ToolBox is properly installed.	The ascend-dmi cannot be used for performance analysis.
Warning of an NPU driver exception	NpuDriverAbnormalWarning	Major	The NPU driver was abnormal.	Reinstall the NPU driver.	NPUs cannot be used.