Monitored ECS Events
Scenarios
Event monitoring provides data collection, query, and alarm reporting for events. You can create alarm rules to get notified when specific events happen.
This section describes the ECS events monitored by Cloud Eye.
Namespace
SYS.ECS
Monitored Events
Event Name |
Event ID |
Event Severity |
Description |
Solution |
Impact |
---|---|---|---|---|---|
Restart triggered due to system faults |
startAutoRecovery |
Major |
ECSs on a faulty host would be automatically migrated to another properly-running host. During the migration, the ECSs were restarted. |
Wait for the event to end and check whether services are affected. |
Services may be interrupted. |
Redeployment triggered by system faults completed |
endAutoRecovery |
Major |
ECSs are recovered after the automatic migration is complete. |
This event indicates that the ECS has recovered and has been working properly. |
None |
Auto recovery timeout (being processed on the backend) |
faultAutoRecovery |
Major |
Migrating the ECS to a normal host timed out. |
Migrate services to another ECS. |
Services are interrupted. |
GPU link fault |
GPULinkFault |
Critical |
The GPU of the host running the ECS was faulty or recovering from a fault. |
Deploy service applications in HA mode. After the GPU fault is rectified, check whether services recover. |
Services are interrupted. |
ECS deleted |
deleteServer |
Major |
The ECS was deleted:
|
Check whether the operation was intentionally performed by a user. |
Services are interrupted. |
ECS restarted |
rebootServer |
Minor |
The ECS was restarted:
|
Check whether the operation was intentionally performed by a user.
|
Services are interrupted. |
ECS stopped |
stopServer |
Minor |
The ECS was stopped:
NOTE:
The ECS is stopped only after CTS is enabled. |
|
Services are interrupted. |
NIC deleted |
deleteNic |
Major |
The ECS NIC was deleted:
|
|
If the NIC is deleted, services may be interrupted. |
ECS resized |
resizeServer |
Minor |
The ECS specifications were modified:
|
|
Services are interrupted. |
GuestOS restarted |
RestartGuestOS |
Minor |
The guest OS was restarted. |
Contact O&M personnel. |
Services may be interrupted. |
ECS failure caused by system faults |
VMFaultsByHostProcessExceptions |
Critical |
The host where the ECS resides is faulty. The system will automatically try to start the ECS. |
After the ECS is started, check whether this ECS and services on it can run properly. |
The ECS is faulty. |
Startup failure |
faultPowerOn |
Major |
The ECS failed to start. |
Start the ECS again. If the problem persists, contact O&M personnel. |
The ECS cannot start. |
Host breakdown risk |
hostMayCrash |
Major |
The host where the ECS resides may break down, and the risk cannot be prevented through live migration due to some reasons. |
Migrate services running on the ECS first and then delete or stop the ECS. Start the ECS only after the O&M personnel handle the risk. |
The host may break down, causing service interruptions. |
Scheduled migration completed |
instance_migrate_completed |
Major |
Scheduled ECS migration is complete. |
Wait until the ECS becomes available and check whether services recover. |
Services may be interrupted. |
Scheduled migration being executed |
instance_migrate_executing |
Major |
ECSs are being migrated as scheduled. |
Wait until the event is complete and check whether services are affected. |
Services may be interrupted. |
Scheduled migration canceled |
instance_migrate_canceled |
Major |
Scheduled ECS migration is canceled. |
None |
None |
Scheduled migration failed |
instance_migrate_failed |
Major |
ECSs failed to be migrated as scheduled. |
Contact O&M personnel. |
Services are interrupted. |
Scheduled migration to be executed |
instance_migrate_scheduled |
Major |
ECSs will be migrated as scheduled. |
Check the impact on services during the execution window. |
None |
Scheduled specification modification failed |
instance_resize_failed |
Major |
Specifications failed to be modified as scheduled. |
Contact O&M personnel. |
Services are interrupted. |
Scheduled specification modification completed |
instance_resize_completed |
Major |
Scheduled specification modification is complete. |
None |
None |
Scheduled specification modification being executed |
instance_resize_executing |
Major |
Specifications are being modified as scheduled. |
Wait until the event is complete and check whether the specifications were modified. |
Services are interrupted. |
Scheduled specification modification canceled |
instance_resize_canceled |
Major |
Scheduled specification modification is canceled. |
None |
None |
Scheduled specification modification to be executed |
instance_resize_scheduled |
Major |
Specifications will be modified as scheduled. |
Check the impact on services during the execution window. |
None |
Scheduled redeployment to be executed |
instance_redeploy_scheduled |
Major |
ECSs will be redeployed on new hosts as scheduled. |
Check the impact on services during the execution window. |
None |
Scheduled restart to be executed |
instance_reboot_scheduled |
Major |
ECSs will be restarted as scheduled. |
Check the impact on services during the execution window. |
None |
Scheduled stop to be executed |
instance_stop_scheduled |
Major |
ECSs will be stopped as scheduled. |
Check the impact on services during the execution window. |
None |
ECC uncorrectable error alarm generated on GPU SRAM |
SRAMUncorrectableEccError |
Major |
There are ECC uncorrectable errors generated on GPU SRAM. |
If services are affected, submit a service ticket. |
The GPU hardware may be faulty. As a result, the SRAM is faulty, and services exit abnormally. |
FPGA link fault |
FPGALinkFault |
Critical |
The FPGA of the host running ECSs was faulty or recovering from a fault. |
Deploy service applications in HA mode. After the FPGA fault is rectified, check whether services recover. |
Services are interrupted. |
Scheduled redeployment to be authorized |
instance_redeploy_inquiring |
Major |
Scheduled ECS redeployment is to be authorized. |
Authorize scheduled redeployment. |
None |
Local disk replacement canceled |
localdisk_recovery_canceled |
Major |
Local disk replacement is canceled. |
None |
None |
Local disk replacement to be executed |
localdisk_recovery_scheduled |
Major |
Faulty local disks are waiting to be replaced. |
Check the impact on services during the execution window. |
None |
Xid event alarm generated on GPU |
commonXidError |
Major |
A Xid event alarm was generated on the GPU. |
If services are affected, submit a service ticket. |
A Xid error is caused by GPU hardware, driver, or application problems, which may cause services to exit abnormally. |
nvidia-smi suspended |
nvidiaSmiHangEvent |
Major |
nvidia-smi timed out. |
If services are affected, submit a service ticket. |
The driver may report an error during service running. |
NPU: uncorrectable ECC error |
UncorrectableEccErrorCount |
Major |
There are uncorrectable ECC errors on the NPU. |
If services are affected, replace the NPU with another one. |
Services may be interrupted. |
Scheduled redeployment canceled |
instance_redeploy_canceled |
Major |
Scheduled redeployment is canceled. |
None |
None |
Scheduled redeployment being executed |
instance_redeploy_executing |
Major |
ECSs are being redeployed on a new host as scheduled. |
Wait until the event is complete and check whether services are affected. |
Services are interrupted. |
Scheduled redeployment completed |
instance_redeploy_completed |
Major |
Scheduled redeployment is complete. |
Wait until the redeployed ECSs are available and check whether services are affected. |
None |
Scheduled redeployment failed |
instance_redeploy_failed |
Major |
ECSs are failed to be redeployed as scheduled. |
Contact O&M personnel. |
Services are interrupted. |
Local disk replacement to be authorized |
localdisk_recovery_inquiring |
Major |
Local disks are faulty. |
Authorize local disk replacement. |
Local disks are unavailable. |
Local disks being replaced |
localdisk_recovery_executing |
Major |
Local disks are faulty. |
Wait until the local disks are replaced and check whether the local disks are available. |
Local disks are unavailable. |
Local disks replaced |
localdisk_recovery_completed |
Major |
Local disks are faulty. |
Check whether local disks are available. |
None |
Local disk replacement failed |
localdisk_recovery_failed |
Major |
Local disks are faulty. |
Contact O&M personnel. |
Local disks are unavailable. |
GPU throttle alarm |
gpuClocksThrottleReasonsAlarm |
Informational |
|
Check whether the clock frequency decrease is caused by hardware faults. If yes, transfer it to the hardware team. |
The GPU slows down, resulting in less powerful compute. |
Pending page retirement for GPU DRAM ECC |
gpuRetiredPagesPendingAlarm |
Major |
|
|
The GPU cannot work properly. |
Pending row remapping for GPU DRAM ECC |
gpuRemappedRowsAlarm |
Major |
Some rows in the GPU memory have errors and need to be remapped. The faulty rows must be mapped to standby resources. |
|
The GPU cannot work properly. |
Insufficient resources for GPU DRAM ECC row remapping |
gpuRowRemapperResourceAlarm |
Major |
|
Transfer the issue to the hardware team. |
The GPU cannot work properly. |
Correctable GPU DRAM ECC error |
gpuDRAMCorrectableEccError |
Major |
|
|
The GPU may not work properly. |
Uncorrectable GPU DRAM ECC error |
gpuDRAMUncorrectableEccError |
Major |
|
|
The GPU may not work properly. |
Inconsistent GPU kernel versions |
gpuKernelVersionInconsistencyAlarm |
Major |
The current kernel version of the GPU is inconsistent with that during the driver installation. During driver installation, the GPU driver is compiled based on the kernel at that time. If the kernel versions are identified inconsistent, the kernel has been customized after the driver installation. In this case, the driver would become unavailable and needs to be reinstalled. |
|
The GPU cannot work properly. |
GPU monitoring dependency not met |
gpuCheckEnvFailedAlarm |
Major |
The plug-in cannot identify the GPU driver library path. |
|
The GPU metrics cannot be collected. |
Initialization failure of the GPU monitoring driver library |
gpuDriverInitFailedAlarm |
Major |
The GPU driver is unavailable. |
Run nvidia-smi to check whether the driver is unavailable. If the driver is unavailable, reinstall the driver by referring to Manually Installing a Tesla Driver on a GPU-accelerated ECS. |
The GPU metrics cannot be collected. |
Initialization timeout of the GPU monitoring driver library |
gpuDriverInitTAlarm |
Major |
The GPU driver initialization timed out (exceeding 10s). |
|
The GPU metrics cannot be collected. |
GPU metric collection timeout |
gpuCollectMetricTimeoutAlarm |
Major |
The GPU metric collection timed out (exceeding 10s). |
|
GPU monitoring metric data is missing. As a result, subsequent metrics may fail to be collected. |
GPU handle lost |
gpuDeviceHandleLost |
Major |
The GPU metric information cannot be obtained, and the GPU may be lost. |
All metrics of the GPU are lost. |
|
Failed to listen to the XID of the GPU |
gpuDeviceXidLost |
Major |
Failed to listen to the Xid metric. |
|
Failed to obtain Xid-related metrics of the GPU. |
ReadOnly issues in OS |
ReadOnlyFileSystem |
Critical |
The file system %s is read-only. |
Check the disk health status. |
The files cannot be written. |
NPU: driver and firmware not matching |
NpuDriverFirmwareMismatch |
Major |
The NPU's driver and firmware do not match. |
Obtain the matched version from the Ascend official website and reinstall it. |
NPUs cannot be used. |
NPU: Docker container environment check |
NpuContainerEnvSystem |
Major |
Docker was unavailable. |
Check if Docker is normal. |
Docker cannot be used. |
Major |
The container plug-in Ascend-Docker-Runtime was not installed. |
Install the container plug-in Ascend-Docker-Runtime, or the container cannot use Ascend cards. |
NPUs cannot be attached to Docker containers. |
||
Major |
IP forwarding was not enabled in the OS. |
Check the net.ipv4.ip_forward configuration in the /etc/sysctl.conf file. |
Docker containers have network communication problems. |
||
Major |
The shared memory of the container was too small. |
The default shared memory is 64 MB, which can be modified as needed. Method 1 Modify the default-shm-size field in the /etc/docker/daemon.json configuration file. Method 2 Use the --shm-size parameter in the docker run command to set the shared memory size of a container. |
Distributed training will fail due to insufficient shared memory. |
||
NPU: RoCE NIC down |
RoCELinkStatusDown |
Major |
The RoCE link of NPU card %d was down. |
Check the NPU RoCE network port status. |
The NPU NIC is unavailable. |
NPU: RoCE NIC health status abnormal |
RoCEHealthStatusError |
Major |
The RoCE network health status of NPU %d was abnormal. |
Check the health status of the NPU RoCE NIC. |
The NPU NIC is unavailable. |
NPU: RoCE NIC configuration file /etc/hccn.conf not found |
HccnConfNotExisted |
Major |
The RoCE NIC configuration file /etc/hccn.conf was not found. |
Check whether the NIC configuration file /etc/hccn.conf can be found. |
The RoCE NIC is unavailable. |
GPU: basic components abnormal |
GpuEnvironmentSystem |
Major |
The nvidia-smi command was abnormal. |
Check whether the GPU driver is normal. |
The GPU driver is unavailable. |
Major |
The nvidia-fabricmanager version was inconsistent with the GPU driver version. |
Check the GPU driver version and nvidia-fabricmanager version. |
The nvidia-fabricmanager cannot work properly, affecting GPU usage. |
||
Major |
The container add-on nvidia-container-toolkit was not installed. |
Install the container add-on nvidia-container-toolkit. |
GPUs cannot be attached to Docker containers. |
||
Local disk attachment inspection |
MountDiskSystem |
Major |
The /etc/fstab file contains invalid UUIDs. |
Ensure that the UUIDs in the /etc/fstab configuration file are correct, or the server may fail to be restarted. |
The disk attachment process fails, preventing the server from restarting. |
GPU: incorrectly configured dynamic route for Ant series server |
GpuRouteConfigError |
Major |
The dynamic route of the NIC %s of an Ant series server was not configured or was incorrectly configured. CMD [ip route]: %s | CMD [ip route show table all]: %s. |
Configure the RoCE NIC route correctly. |
The NPU network communication will be interrupted. |
NPU: RoCE port not split |
RoCEUdpConfigError |
Major |
The RoCE UDP port was not split. |
Check the RoCE UDP port configuration on the NPU. |
The communication performance of NPUs is affected. |
Warning of automatic system kernel upgrade |
KernelUpgradeWarning |
Major |
Warning of automatic system kernel upgrade. Old version: %s; new version: %s. |
System kernel upgrade may cause AI software exceptions. Check the system update logs and prevent the server from restarting. |
The AI software may be unavailable. |
NPU environment command detection |
NpuToolsWarning |
Major |
The hccn_tool was unavailable. |
Check whether the NPU driver is normal. |
The IP address and gateway of the RoCE NIC cannot be configured. |
Major |
The npu-smi was unavailable. |
Check whether the NPU driver is normal. |
NPUs cannot be used. |
||
Major |
The ascend-dmi was unavailable. |
Check whether ToolBox is properly installed. |
The ascend-dmi cannot be used for performance analysis. |
||
Warning of an NPU driver exception |
NpuDriverAbnormalWarning |
Major |
The NPU driver was abnormal. |
Reinstall the NPU driver. |
NPUs cannot be used. |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot