Updated on 2025-08-12 GMT+08:00

Monitored ECS Events

Scenarios

Event monitoring provides data collection, query, and alarm reporting for events. You can create alarm rules to get notified when specific events happen.

This section describes the ECS events monitored by Cloud Eye.

Namespace

SYS.ECS

Monitored Events

Table 1 ECS

Event Name

Event ID

Event Severity

Description

Solution

Impact

Restart triggered due to system faults

startAutoRecovery

Major

ECSs on a faulty host would be automatically migrated to another properly-running host. During the migration, the ECSs were restarted.

Wait for the event to end and check whether services are affected.

Services may be interrupted.

Redeployment triggered by system faults completed

endAutoRecovery

Major

ECSs are recovered after the automatic migration is complete.

This event indicates that the ECS has recovered and has been working properly.

None

Auto recovery timeout (being processed on the backend)

faultAutoRecovery

Major

Migrating the ECS to a normal host timed out.

Migrate services to another ECS.

Services are interrupted.

GPU link fault

GPULinkFault

Critical

The GPU of the host running the ECS was faulty or recovering from a fault.

Deploy service applications in HA mode.

After the GPU fault is rectified, check whether services recover.

Services are interrupted.

ECS deleted

deleteServer

Major

The ECS was deleted:

  • On the management console
  • By calling APIs

Check whether the operation was intentionally performed by a user.

Services are interrupted.

ECS restarted

rebootServer

Minor

The ECS was restarted:

  • On the management console
  • By calling APIs

Check whether the operation was intentionally performed by a user.

  • Deploy service applications in HA mode.
  • After the ECS starts up, check whether services recover.

Services are interrupted.

ECS stopped

stopServer

Minor

The ECS was stopped:

  • On the management console
  • By calling APIs
NOTE:

The ECS is stopped only after CTS is enabled.

  • Check whether the operation was intentionally performed by a user.
  • Deploy service applications in HA mode.
  • After the ECS starts up, check whether services recover.

Services are interrupted.

NIC deleted

deleteNic

Major

The ECS NIC was deleted:

  • On the management console
  • By calling APIs
  • Check whether the operation was intentionally performed by a user.
  • Deploy service applications in HA mode.
  • After the NIC is deleted, check whether services recover.

If the NIC is deleted, services may be interrupted.

ECS resized

resizeServer

Minor

The ECS specifications were modified:

  • On the management console
  • By calling APIs
  • Check whether the operation was intentionally performed by a user.
  • Deploy service applications in HA mode.
  • After the ECS is resized, check whether services recover.

Services are interrupted.

GuestOS restarted

RestartGuestOS

Minor

The guest OS was restarted.

Contact O&M personnel.

Services may be interrupted.

ECS failure caused by system faults

VMFaultsByHostProcessExceptions

Critical

The host where the ECS resides is faulty. The system will automatically try to start the ECS.

After the ECS is started, check whether this ECS and services on it can run properly.

The ECS is faulty.

Startup failure

faultPowerOn

Major

The ECS failed to start.

Start the ECS again. If the problem persists, contact O&M personnel.

The ECS cannot start.

Host breakdown risk

hostMayCrash

Major

The host where the ECS resides may break down, and the risk cannot be prevented through live migration due to some reasons.

Migrate services running on the ECS first and then delete or stop the ECS. Start the ECS only after the O&M personnel handle the risk.

The host may break down, causing service interruptions.

Scheduled migration completed

instance_migrate_completed

Major

Scheduled ECS migration is complete.

Wait until the ECS becomes available and check whether services recover.

Services may be interrupted.

Scheduled migration being executed

instance_migrate_executing

Major

ECSs are being migrated as scheduled.

Wait until the event is complete and check whether services are affected.

Services may be interrupted.

Scheduled migration canceled

instance_migrate_canceled

Major

Scheduled ECS migration is canceled.

None

None

Scheduled migration failed

instance_migrate_failed

Major

ECSs failed to be migrated as scheduled.

Contact O&M personnel.

Services are interrupted.

Scheduled migration to be executed

instance_migrate_scheduled

Major

ECSs will be migrated as scheduled.

Check the impact on services during the execution window.

None

Scheduled specification modification failed

instance_resize_failed

Major

Specifications failed to be modified as scheduled.

Contact O&M personnel.

Services are interrupted.

Scheduled specification modification completed

instance_resize_completed

Major

Scheduled specification modification is complete.

None

None

Scheduled specification modification being executed

instance_resize_executing

Major

Specifications are being modified as scheduled.

Wait until the event is complete and check whether the specifications were modified.

Services are interrupted.

Scheduled specification modification canceled

instance_resize_canceled

Major

Scheduled specification modification is canceled.

None

None

Scheduled specification modification to be executed

instance_resize_scheduled

Major

Specifications will be modified as scheduled.

Check the impact on services during the execution window.

None

Scheduled redeployment to be executed

instance_redeploy_scheduled

Major

ECSs will be redeployed on new hosts as scheduled.

Check the impact on services during the execution window.

None

Scheduled restart to be executed

instance_reboot_scheduled

Major

ECSs will be restarted as scheduled.

Check the impact on services during the execution window.

None

Scheduled stop to be executed

instance_stop_scheduled

Major

ECSs will be stopped as scheduled.

Check the impact on services during the execution window.

None

ECC uncorrectable error alarm generated on GPU SRAM

SRAMUncorrectableEccError

Major

There are ECC uncorrectable errors generated on GPU SRAM.

If services are affected, submit a service ticket.

The GPU hardware may be faulty. As a result, the SRAM is faulty, and services exit abnormally.

FPGA link fault

FPGALinkFault

Critical

The FPGA of the host running ECSs was faulty or recovering from a fault.

Deploy service applications in HA mode.

After the FPGA fault is rectified, check whether services recover.

Services are interrupted.

Scheduled redeployment to be authorized

instance_redeploy_inquiring

Major

Scheduled ECS redeployment is to be authorized.

Authorize scheduled redeployment.

None

Local disk replacement canceled

localdisk_recovery_canceled

Major

Local disk replacement is canceled.

None

None

Local disk replacement to be executed

localdisk_recovery_scheduled

Major

Faulty local disks are waiting to be replaced.

Check the impact on services during the execution window.

None

Xid event alarm generated on GPU

commonXidError

Major

A Xid event alarm was generated on the GPU.

If services are affected, submit a service ticket.

A Xid error is caused by GPU hardware, driver, or application problems, which may cause services to exit abnormally.

nvidia-smi suspended

nvidiaSmiHangEvent

Major

nvidia-smi timed out.

If services are affected, submit a service ticket.

The driver may report an error during service running.

NPU: uncorrectable ECC error

UncorrectableEccErrorCount

Major

There are uncorrectable ECC errors on the NPU.

If services are affected, replace the NPU with another one.

Services may be interrupted.

Scheduled redeployment canceled

instance_redeploy_canceled

Major

Scheduled redeployment is canceled.

None

None

Scheduled redeployment being executed

instance_redeploy_executing

Major

ECSs are being redeployed on a new host as scheduled.

Wait until the event is complete and check whether services are affected.

Services are interrupted.

Scheduled redeployment completed

instance_redeploy_completed

Major

Scheduled redeployment is complete.

Wait until the redeployed ECSs are available and check whether services are affected.

None

Scheduled redeployment failed

instance_redeploy_failed

Major

ECSs are failed to be redeployed as scheduled.

Contact O&M personnel.

Services are interrupted.

Local disk replacement to be authorized

localdisk_recovery_inquiring

Major

Local disks are faulty.

Authorize local disk replacement.

Local disks are unavailable.

Local disks being replaced

localdisk_recovery_executing

Major

Local disks are faulty.

Wait until the local disks are replaced and check whether the local disks are available.

Local disks are unavailable.

Local disks replaced

localdisk_recovery_completed

Major

Local disks are faulty.

Check whether local disks are available.

None

Local disk replacement failed

localdisk_recovery_failed

Major

Local disks are faulty.

Contact O&M personnel.

Local disks are unavailable.

GPU throttle alarm

gpuClocksThrottleReasonsAlarm

Informational

  1. The GPU power may exceed the maximum operating power threshold (continuous full load). The clock frequency automatically decreases to prevent the GPU from being damaged.
  2. The GPU temperature may exceed the maximum operating temperature threshold (continuous full load). The clock frequency automatically decreases to reduce heat.
  3. The GPU may remain idle, with the clock frequency automatically decreasing to reduce power consumption.
  4. Hardware faults may cause a decrease in clock frequency.

Check whether the clock frequency decrease is caused by hardware faults. If yes, transfer it to the hardware team.

The GPU slows down, resulting in less powerful compute.

Pending page retirement for GPU DRAM ECC

gpuRetiredPagesPendingAlarm

Major

  1. An ECC error occurred on the hardware. DRAM pages need to be retired.
  2. An uncorrectable ECC error occurred on the GPU memory page, and the page needs to be retired. However, the page is suspended and has not been retired yet.
  1. View the event details and check whether the value of retired_pages.pending is yes.
  2. Restart the GPU for automatic retirement.

The GPU cannot work properly.

Pending row remapping for GPU DRAM ECC

gpuRemappedRowsAlarm

Major

Some rows in the GPU memory have errors and need to be remapped. The faulty rows must be mapped to standby resources.

  1. View the event metric "RemappedRow" to check if there are any rows that have been remapped.
  2. Restart the GPU for automatic retirement.

The GPU cannot work properly.

Insufficient resources for GPU DRAM ECC row remapping

gpuRowRemapperResourceAlarm

Major

  1. This event occurs on GPUs (Ampere and later architectures).
  2. The standby GPU memory row resources are exhausted, so row remapping cannot be continued.

Transfer the issue to the hardware team.

The GPU cannot work properly.

Correctable GPU DRAM ECC error

gpuDRAMCorrectableEccError

Major

  1. This event occurs on GPUs (Ampere and later architectures).
  2. A correctable ECC error occurs in the DRAM of the GPU. However, the ECC mechanism can automatically rectify the error and programs are not affected.
  1. View the event metric "ecc.errors.corrected.volatile" to check whether there are any correctable ECC error values.
  2. Restart the GPU for automatic retirement.

The GPU may not work properly.

Uncorrectable GPU DRAM ECC error

gpuDRAMUncorrectableEccError

Major

  1. This event occurs on GPUs (Ampere and later architectures).
  2. An uncorrectable ECC error occurs in the DRAM of the GPU. This error cannot be automatically corrected using the ECC mechanism. The verification process affects system stability and may cause program crashes.
  1. View the event metric "ecc.errors.uncorrected.volatile" to check whether there are any uncorrectable ECC error values.
  2. Restart the GPU for automatic retirement.

The GPU may not work properly.

Inconsistent GPU kernel versions

gpuKernelVersionInconsistencyAlarm

Major

The current kernel version of the GPU is inconsistent with that during the driver installation.

During driver installation, the GPU driver is compiled based on the kernel at that time. If the kernel versions are identified inconsistent, the kernel has been customized after the driver installation. In this case, the driver would become unavailable and needs to be reinstalled.

  1. Run the following commands to rectify the issue:

    rmmod nvidia_drm

    rmmod nvidia_modeset

    rmmod nvidia

    Then, run nvidia-smi. If the command output is normal, the issue has been rectified.

  1. If the preceding solution does not work, rectify the fault by referring to Why Is the GPU Driver Unavailable?

The GPU cannot work properly.

GPU monitoring dependency not met

gpuCheckEnvFailedAlarm

Major

The plug-in cannot identify the GPU driver library path.

  1. Check whether the driver is installed.
  2. Check whether the driver installation directory has been customized. The driver needs to be installed in the default installation directory /usr/bin/.

The GPU metrics cannot be collected.

Initialization failure of the GPU monitoring driver library

gpuDriverInitFailedAlarm

Major

The GPU driver is unavailable.

Run nvidia-smi to check whether the driver is unavailable. If the driver is unavailable, reinstall the driver by referring to Manually Installing a Tesla Driver on a GPU-accelerated ECS.

The GPU metrics cannot be collected.

Initialization timeout of the GPU monitoring driver library

gpuDriverInitTAlarm

Major

The GPU driver initialization timed out (exceeding 10s).

  1. If the driver is not installed, install it by referring to Manually Installing a Tesla Driver on a GPU-accelerated ECS.
  2. If the driver is installed, run nvidia-smi to check whether the driver is available. If the driver is unavailable, reinstall the driver by referring to Manually Installing a Tesla Driver on a GPU-accelerated ECS.
  3. If the driver is properly installed, check whether the high-performance mode is enabled. If not, run nvidia-smi -pm 1 to enable it. P0 indicates the high-performance mode.

The GPU metrics cannot be collected.

GPU metric collection timeout

gpuCollectMetricTimeoutAlarm

Major

The GPU metric collection timed out (exceeding 10s).

  1. If the library API timed out, run nvidia-smi to check whether the driver is available. If the driver is unavailable, reinstall the driver by referring to Manually Installing a Tesla Driver on a GPU-accelerated ECS.
  2. If the command execution timed out, check the system logs and determine whether there is an issue with the system.

GPU monitoring metric data is missing. As a result, subsequent metrics may fail to be collected.

GPU handle lost

gpuDeviceHandleLost

Major

The GPU metric information cannot be obtained, and the GPU may be lost.

  1. Run nvidia-smi to check whether there are any errors reported.
  2. Run nvidia-smi -L to check whether the number of GPUs is the same as the server specifications.
  3. Submit a service ticket to contact on-call support.

All metrics of the GPU are lost.

Failed to listen to the XID of the GPU

gpuDeviceXidLost

Major

Failed to listen to the Xid metric.

  1. Check whether the GPU is lost or damaged.
  2. Submit a service ticket to contact on-call support.

Failed to obtain Xid-related metrics of the GPU.

ReadOnly issues in OS

ReadOnlyFileSystem

Critical

The file system %s is read-only.

Check the disk health status.

The files cannot be written.

NPU: driver and firmware not matching

NpuDriverFirmwareMismatch

Major

The NPU's driver and firmware do not match.

Obtain the matched version from the Ascend official website and reinstall it.

NPUs cannot be used.

NPU: Docker container environment check

NpuContainerEnvSystem

Major

Docker was unavailable.

Check if Docker is normal.

Docker cannot be used.

Major

The container plug-in Ascend-Docker-Runtime was not installed.

Install the container plug-in Ascend-Docker-Runtime, or the container cannot use Ascend cards.

NPUs cannot be attached to Docker containers.

Major

IP forwarding was not enabled in the OS.

Check the net.ipv4.ip_forward configuration in the /etc/sysctl.conf file.

Docker containers have network communication problems.

Major

The shared memory of the container was too small.

The default shared memory is 64 MB, which can be modified as needed.

Method 1

Modify the default-shm-size field in the /etc/docker/daemon.json configuration file.

Method 2

Use the --shm-size parameter in the docker run command to set the shared memory size of a container.

Distributed training will fail due to insufficient shared memory.

NPU: RoCE NIC down

RoCELinkStatusDown

Major

The RoCE link of NPU card %d was down.

Check the NPU RoCE network port status.

The NPU NIC is unavailable.

NPU: RoCE NIC health status abnormal

RoCEHealthStatusError

Major

The RoCE network health status of NPU %d was abnormal.

Check the health status of the NPU RoCE NIC.

The NPU NIC is unavailable.

NPU: RoCE NIC configuration file /etc/hccn.conf not found

HccnConfNotExisted

Major

The RoCE NIC configuration file /etc/hccn.conf was not found.

Check whether the NIC configuration file /etc/hccn.conf can be found.

The RoCE NIC is unavailable.

GPU: basic components abnormal

GpuEnvironmentSystem

Major

The nvidia-smi command was abnormal.

Check whether the GPU driver is normal.

The GPU driver is unavailable.

Major

The nvidia-fabricmanager version was inconsistent with the GPU driver version.

Check the GPU driver version and nvidia-fabricmanager version.

The nvidia-fabricmanager cannot work properly, affecting GPU usage.

Major

The container add-on nvidia-container-toolkit was not installed.

Install the container add-on nvidia-container-toolkit.

GPUs cannot be attached to Docker containers.

Local disk attachment inspection

MountDiskSystem

Major

The /etc/fstab file contains invalid UUIDs.

Ensure that the UUIDs in the /etc/fstab configuration file are correct, or the server may fail to be restarted.

The disk attachment process fails, preventing the server from restarting.

GPU: incorrectly configured dynamic route for Ant series server

GpuRouteConfigError

Major

The dynamic route of the NIC %s of an Ant series server was not configured or was incorrectly configured. CMD [ip route]: %s | CMD [ip route show table all]: %s.

Configure the RoCE NIC route correctly.

The NPU network communication will be interrupted.

NPU: RoCE port not split

RoCEUdpConfigError

Major

The RoCE UDP port was not split.

Check the RoCE UDP port configuration on the NPU.

The communication performance of NPUs is affected.

Warning of automatic system kernel upgrade

KernelUpgradeWarning

Major

Warning of automatic system kernel upgrade. Old version: %s; new version: %s.

System kernel upgrade may cause AI software exceptions. Check the system update logs and prevent the server from restarting.

The AI software may be unavailable.

NPU environment command detection

NpuToolsWarning

Major

The hccn_tool was unavailable.

Check whether the NPU driver is normal.

The IP address and gateway of the RoCE NIC cannot be configured.

Major

The npu-smi was unavailable.

Check whether the NPU driver is normal.

NPUs cannot be used.

Major

The ascend-dmi was unavailable.

Check whether ToolBox is properly installed.

The ascend-dmi cannot be used for performance analysis.

Warning of an NPU driver exception

NpuDriverAbnormalWarning

Major

The NPU driver was abnormal.

Reinstall the NPU driver.

NPUs cannot be used.