Events Supported by Event Monitoring

The name of a resource that supports event reporting can contain a maximum of 128 characters, including letters, digits, underscores (_), hyphens (-), and periods (.). If it contains other characters, the event may fail to be reported to Cloud Eye.

**Table 1** Elastic Cloud Server (ECS)
Event Source	Event Name	Event ID	Event Severity	Description	Solution	Impact
ECS	Restart triggered due to hardware fault	startAutoRecovery	Major	ECSs on a faulty host would be automatically migrated to another properly-running host. During the migration, the ECSs was restarted.	Wait for the event to end and check whether services are affected.	Services may be interrupted.
	Restart completed due to hardware failure	endAutoRecovery	Major	The ECS was recovered after the automatic migration.	This event indicates that the ECS has recovered and been working properly.	None
	Auto recovery timeout (being processed on the backend)	faultAutoRecovery	Major	Migrating the ECS to a normal host timed out.	Migrate services to other ECSs.	Services are interrupted.
	GPU link fault	GPULinkFault	Critical	The GPU of the host running the ECS was faulty or recovering from a fault.	Deploy service applications in HA mode. After the GPU fault is rectified, check whether services are restored.	Services are interrupted.
	ECS deleted	deleteServer	Major	The ECS was deleted: on the management console. by calling APIs.	Check whether the deletion was performed intentionally by a user.	Services are interrupted.
	ECS restarted	rebootServer	Minor	The ECS was restarted: on the management console. by calling APIs.	Check whether the restart was performed intentionally by a user. Deploy service applications in HA mode. After the ECS starts up, check whether services recover.	Services are interrupted.
	ECS stopped	stopServer	Minor	The ECS was stopped: on the management console. by calling APIs. NOTE: The ECS is stopped only after CTS is enabled.	Check whether the operation was intentionally performed by a user. Deploy service applications in HA mode. After the ECS starts up, check whether services recover.	Services are interrupted.
	NIC deleted	deleteNic	Major	The ECS NIC was deleted: on the management console. by calling APIs.	Check whether the deletion was performed intentionally by a user. Deploy service applications in HA mode. After the NIC is deleted, check whether services recover.	Services may be interrupted.
	ECS resized	resizeServer	Minor	The ECS specifications were modified: on the management console. by calling APIs.	Check whether the operation was performed by a user. Deploy service applications in HA mode. After the ECS is resized, check whether services have recovered.	Services are interrupted.
	GuestOS restarted	RestartGuestOS	Minor	The guest OS was restarted.	Contact O&M personnel.	Services may be interrupted.
	ECS failure caused by system faults	VMFaultsByHostProcessExceptions	Critical	The host where the ECS resides is faulty. The system will automatically try to start the ECS.	After the ECS is started, check whether this ECS and services on it can run properly.	The ECS is faulty.
	Startup failure	faultPowerOn	Major	The ECS failed to start.	Start the ECS again. If the problem persists, contact O&M personnel.	The ECS cannot start.
	Host breakdown risk	hostMayCrash	Major	The host where the ECS resides may break down, and the risk cannot be prevented through live migration due to some reasons.	Migrate services running on the ECS first and delete or stop the ECS. Start the ECS only after the O&M personnel eliminate the risk.	The host may break down, causing service interruption.
	Scheduled migration completed	instance_migrate_completed	Major	Scheduled ECS migration is completed.	Wait until the ECSs become available and check whether services are affected.	Services may be interrupted.
	Scheduled migration being executed	instance_migrate_executing	Major	ECSs are being migrated as scheduled.	Wait until the event is complete and check whether services are affected.	Services may be interrupted.
	Scheduled migration canceled	instance_migrate_canceled	Major	Scheduled ECS migration is canceled.	None	None
	Scheduled migration failed	instance_migrate_failed	Major	ECSs failed to be migrated as scheduled.	Contact O&M personnel.	Services are interrupted.
	Scheduled migration to be executed	instance_migrate_scheduled	Major	ECSs will be migrated as scheduled.	Clarify the impact on services during the execution window.	None
	Scheduled specification modification failed	instance_resize_failed	Major	Specifications failed to be modified as scheduled.	Contact O&M personnel.	Services are interrupted.
	Scheduled specification modification completed	instance_resize_completed	Major	Scheduled specifications modification is completed.	None	None
	Scheduled specification modification being executed	instance_resize_executing	Major	Specifications are being modified as scheduled.	Wait until the event is completed and check whether services are affected.	Services are interrupted.
	Scheduled specification modification canceled	instance_resize_canceled	Major	Scheduled specifications modification is canceled.	None	None
	Scheduled specification modification to be executed	instance_resize_scheduled	Major	Specifications will be modified as scheduled.	Check the impact on services during the execution window.	None
	Scheduled redeployment to be executed	instance_redeploy_scheduled	Major	ECSs will be redeployed on new hosts as scheduled.	Check the impact on services during the execution window.	None
	Scheduled restart to be executed	instance_reboot_scheduled	Major	ECSs will be restarted as scheduled.	Check the impact on services during the execution window.	None
	Scheduled stop to be executed	instance_stop_scheduled	Major	ECSs will be stopped as scheduled as they are affected by underlying hardware or system O&M.	Check the impact on services during the execution window.	None
	Live migration started	liveMigrationStarted	Major	The host where the ECS is located may be faulty. Live migrate the ECS in advance to prevent service interruptions caused by host breakdown.	Wait for the event to end and check whether services are affected.	Services may be interrupted for less than 1s.
	Live migration completed	liveMigrationCompleted	Major	The live migration is complete, and the ECS is running properly.	Check whether services are running properly.	None
	Live migration failure	liveMigrationFailed	Major	An error occurred during the live migration of an ECS.	Check whether services are running properly.	There is a low probability that services are interrupted.
	ECC uncorrectable error alarm generated on GPU SRAM	SRAMUncorrectableEccError	Major	There are ECC uncorrectable errors generated on GPU SRAM.	If services are affected, submit a service ticket.	The GPU hardware may be faulty. As a result, the SRAM is faulty, and services exit abnormally.
	FPGA link fault	FPGALinkFault	Critical	The FPGA of the host running the ECS was faulty or recovering from a fault.	Deploy service applications in HA mode. After the FPGA fault is rectified, check whether services are restored.	Services are interrupted.
	Scheduled redeployment to be authorized	instance_redeploy_inquiring	Major	As being affected by underlying hardware or system O&M, ECSs will be redeployed on new hosts as scheduled.	Authorize scheduled redeployment.	None
	Local disk replacement canceled	localdisk_recovery_canceled	Major	Local disk failure	None	None
	Local disk replacement to be executed	localdisk_recovery_scheduled	Major	Local disk failure	Clarify the impact on services during the execution window.	None
	Xid event alarm generated on GPU	commonXidError	Major	A Xid event alarm was generated on the GPU.	If services are affected, submit a service ticket.	The GPU hardware, driver, and application problems lead to Xid events, which may interrupt services.
	nvidia-smi suspended	nvidiaSmiHangEvent	Major	nvidia-smi timed out.	If services are affected, submit a service ticket.	The driver may report an error during service running.
	NPU: uncorrectable ECC error	UncorrectableEccErrorCount	Major	There are uncorrectable ECC errors on the NPU.	If services are affected, replace the NPU with another one.	Services may be interrupted.
	Scheduled redeployment canceled	instance_redeploy_canceled	Major	As being affected by underlying hardware or system O&M, ECSs will be redeployed on new hosts as scheduled.	None	None
	Scheduled redeployment being executed	instance_redeploy_executing	Major	As being affected by underlying hardware or system O&M, ECSs will be redeployed on new hosts as scheduled.	Wait until the event is complete and check whether services are affected.	Services are interrupted.
	Scheduled redeployment completed	instance_redeploy_completed	Major	As being affected by underlying hardware or system O&M, ECSs will be redeployed on new hosts as scheduled.	Wait until the redeployed ECSs are available and check whether services are affected.	None
	Scheduled redeployment failed	instance_redeploy_failed	Major	As being affected by underlying hardware or system O&M, ECSs will be redeployed on new hosts as scheduled.	Contact O&M personnel.	Services are interrupted.
	Local disk replacement to be authorized	localdisk_recovery_inquiring	Major	Local disks are faulty.	Authorize local disk replacement.	Local disks are unavailable.
	Local disks being replaced	localdisk_recovery_executing	Major	Local disk failure	Wait until the local disks are replaced and check whether the local disks are available.	Local disks are unavailable.
	Local disks replaced	localdisk_recovery_completed	Major	Local disks are faulty.	Wait until the services are running properly and check whether local disks are available.	None
	Local disk replacement failed	localdisk_recovery_failed	Major	Local disks are faulty.	Contact O&M personnel.	Local disks are unavailable.
	GPU throttle alarm	gpuClocksThrottleReasonsAlarm	Informational	The GPU power may exceed the maximum operating power threshold (continuous full load). The clock frequency automatically decreases to prevent the GPU from being damaged. The GPU temperature may exceed the maximum operating temperature threshold (continuous full load). The clock frequency automatically decreases to reduce heat. The GPU may remain idle, with the clock frequency automatically decreasing to reduce power consumption. Hardware faults may cause a decrease in clock frequency.	Check whether the clock frequency decrease is caused by hardware faults. If yes, transfer it to the hardware team.	The GPU slows down, resulting in less powerful compute.
	Pending page retirement for GPU DRAM ECC	gpuRetiredPagesPendingAlarm	Major	An ECC error occurred on the hardware. DRAM pages need to be retired. An uncorrectable ECC error occurred on the GPU memory page and the page needs to be retired. However, the page is suspended and has not been retired yet.	View the event details and check whether the value of retired_pages.pending is yes. Restart the GPU for automatic retirement.	The GPU cannot work properly.
	Pending row remapping for GPU DRAM ECC	gpuRemappedRowsAlarm	Major	Some rows in the GPU memory have errors and need to be remapped. The faulty rows must be mapped to standby resources.	View the event metric "RemappedRow" to check if there are any rows that have been remapped. Restart the GPU for automatic retirement.	The GPU cannot work properly.
	Insufficient resources for GPU DRAM ECC row remapping	gpuRowRemapperResourceAlarm	Major	This event occurs on GPUs (Ampere and later architectures). The standby GPU memory row resources are exhausted, so row remapping cannot be continued.	Transfer the issue to the hardware team.	The GPU cannot work properly.
	Correctable GPU DRAM ECC error	gpuDRAMCorrectableEccError	Major	This event occurs on GPUs (Ampere and later architectures). A correctable ECC error occurs in the DRAM of the GPU. However, the ECC mechanism can automatically rectify the error and programs are not affected.	View the event metric "ecc.errors.corrected.volatile" to check whether there are any correctable ECC error values. Restart the GPU for automatic retirement.	The GPU may not work properly.
	Uncorrectable GPU DRAM ECC error	gpuDRAMUncorrectableEccError	Major	This event occurs on GPUs (Ampere and later architectures). An uncorrectable ECC error occurs in the DRAM of the GPU. This error cannot be automatically corrected using the ECC mechanism. The verification process affects system stability and may cause program crashes.	View the event metric "ecc.errors.uncorrected.volatile" to check whether there are any uncorrectable ECC error values. Restart the GPU for automatic retirement.	The GPU may not work properly.
	Inconsistent GPU kernel versions	gpuKernelVersionInconsistencyAlarm	Major	Inconsistent GPU kernel versions. During driver installation, the GPU driver is compiled based on the kernel at that time. If the kernel versions are identified inconsistent, the kernel has been customized after the driver installation. In this case, the driver would become unavailable and needs to be reinstalled.	Run the following commands to rectify the issue: rmmod nvidia_drm rmmod nvidia_modeset rmmod nvidia Then, run nvidia-smi. If the command output is normal, the issue has been rectified. If the preceding solution does not work, rectify the fault by referring to Why Is the GPU Driver Unavailable?	The GPU cannot work properly.
	GPU monitoring dependency not met	gpuCheckEnvFailedAlarm	Major	The plug-in cannot identify the GPU driver library path.	Check whether the driver is installed. Check whether the driver installation directory has been customized. The driver needs to be installed in the default installation directory /usr/bin/.	Collection failure of GPU monitoring metrics
	Initialization failure of the GPU monitoring driver library	gpuDriverInitFailedAlarm	Major	The GPU driver is unavailable.	Run nvidia-smi to check whether the driver is unavailable. If the driver is unavailable, reinstall the driver by referring to Manually Installing a Tesla Driver on a GPU-accelerated ECS.	Collection failure of GPU monitoring metrics
	Initialization timeout of the GPU monitoring driver library	gpuDriverInitTAlarm	Major	The GPU driver initialization timed out (exceeding 10s).	If the driver is not installed, install it by referring to Manually Installing a Tesla Driver on a GPU-accelerated ECS. If the driver is installed, run nvidia-smi to check whether the driver is available. If the driver is unavailable, reinstall the driver by referring to Manually Installing a Tesla Driver on a GPU-accelerated ECS. If the driver is properly installed, check whether the high-performance mode is enabled. If not, run nvidia-smi -pm 1 to enable it. P0 indicates the high-performance mode.	Collection failure of GPU monitoring metrics
	GPU metric collection timeout	gpuCollectMetricTimeoutAlarm	Major	The GPU metric collection timed out (exceeding 10s).	If the library API timed out, run nvidia-smi to check whether the driver is available. If the driver is unavailable, reinstall the driver by referring to Manually Installing a Tesla Driver on a GPU-accelerated ECS. If the command execution timed out, check the system logs and determine whether there is an issue with the system.	GPU monitoring metric data is missing. As a result, subsequent metrics may fail to be collected.
	GPU handle lost	gpuDeviceHandleLost	Major	The GPU metric information cannot be obtained, and the GPU may be lost.	Run nvidia-smi to check whether there are any errors reported. Run nvidia-smi -L to check whether the number of GPUs is the same as the server specifications. Submit a service ticket to contact on-call support.	All metrics of the GPU are lost.
	Failed to listen to the XID of the GPU.	gpuDeviceXidLost	Major	Failed to listen to the XID metric.	Check whether the GPU is lost or damaged. Submit a service ticket to contact on-call support.	Failed to obtain XID-related metrics of the GPU.
	ReadOnly issues in OS	ReadOnlyFileSystem	Critical	The file system %s is read-only.	Check the disk health status.	The files cannot be written.
	NPU: driver and firmware not matching	NpuDriverFirmwareMismatch	Major	The NPU's driver and firmware do not match.	Obtain the matched version from the Ascend official website and reinstall it.	NPUs cannot be used.
	NPU: Docker container environment check	NpuContainerEnvSystem	Major	Docker was unavailable.	Check if Docker is normal.	Docker cannot be used.
			Major	The container plug-in Ascend-Docker-Runtime was not installed.	Install the container plug-in Ascend-Docker-Runtime. Or, the container cannot use Ascend cards.	NPUs cannot be attached to Docker containers.
			Major	IP forwarding was not enabled in the OS.	Check the net.ipv4.ip_forward configuration in the /etc/sysctl.conf file.	Docker containers experience network communication problems.
			Major	The shared memory of the container was too small.	The default shared memory is 64 MB, which can be modified as needed. Method 1 Modify the default-shm-size field in the /etc/docker/daemon.json configuration file. Method 2 Use the --shm-size parameter in the docker run command to set the shared memory size of a container.	Distributed training will fail due to insufficient shared memory.
NPU: RoCE NIC down	RoCELinkStatusDown	Major	The RoCE link of NPU card %d was down.	Check the NPU RoCE network port status.	The NPU NIC becomes unavailable.
NPU: RoCE NIC health status abnormal	RoCEHealthStatusError	Major	The RoCE network health status of NPU %d was abnormal.	Check the health status of the NPU RoCE NIC.	The NPU NIC becomes unavailable.
NPU: RoCE NIC configuration file /etc/hccn.conf not found	HccnConfNotExisted	Major	The RoCE NIC configuration file /etc/hccn.conf was not found.	Check whether the /etc/hccn.conf NIC configuration file can be found.	The RoCE NIC is unavailable.
GPU: basic components abnormal	GpuEnvironmentSystem	Major	The nvidia-smi command was abnormal.	Check whether the GPU driver is normal.	The GPU driver is unavailable.
		Major	The nvidia-fabricmanager version was inconsistent with the GPU driver version.	Check the GPU driver version and nvidia-fabricmanager version.	The nvidia-fabricmanager cannot work properly, affecting GPU usage.
		Major	The container plug-in nvidia-container-toolkit was not installed.	Install the container plug-in nvidia-container-toolkit.	GPUs cannot be attached to Docker containers.
Local disk attachment inspection	MountDiskSystem	Major	The /etc/fstab file contains invalid UUIDs.	Ensure that the UUIDs in the /etc/fstab configuration file are correct. Or, the server may fail to be restarted.	The disk attachment process fails, preventing the server from restarting.
GPU: incorrectly configured dynamic route for Ant series server	GpuRouteConfigError	Major	The dynamic route of the NIC %s of an Ant series server was not configured or was incorrectly configured. CMD [ip route]: %s \| CMD [ip route show table all]: %s.	Configure the RoCE NIC route correctly.	The NPU network communication will be interrupted.
NPU: RoCE port not split	RoCEUdpConfigError	Major	The RoCE UDP port was not split.	Check the RoCE UDP port configuration on the NPU.	The communication performance of NPUs is affected.
Warning of automatic system kernel upgrade	KernelUpgradeWarning	Major	Warning of automatic system kernel upgrade. Old version: %s; new version: %s.	System kernel upgrade may cause AI software exceptions. Check the system update logs and prevent the server from restarting.	The AI software may be unavailable.
NPU environment command detection	NpuToolsWarning	Major	The hccn_tool was unavailable.	Check whether the NPU driver is normal.	The IP address and gateway of the RoCE NIC cannot be configured.
		Major	The npu-smi was unavailable.	Check whether the NPU driver is normal.	NPUs cannot be used.
		Major	The ascend-dmi was unavailable.	Check whether ToolBox is properly installed.	ascend-dmi cannot be used for performance analysis.
Warning of an NPU driver exception	NpuDriverAbnormalWarning	Major	The NPU driver was abnormal.	Reinstall the NPU driver.	NPUs cannot be used.

Automatic recovery: If the hardware where an ECS is located is faulty, the system automatically migrates it to a normal physical host. The ECS will restart during the migration.

**Table 2** Bare metal server (BMS)
Event Source	Namespace	Event Name	Event ID	Event Severity	Description	Solution	Impact
BMS	SYS.BMS	ECC uncorrectable error alarm generated on GPU SRAM	SRAMUncorrectableEccError	Major	There are ECC uncorrectable errors generated on GPU SRAM.	If services are affected, submit a service ticket.	The GPU hardware may be faulty. As a result, the SRAM is faulty, and services exit abnormally.
		BMS restarted	osReboot	Major	The BMS instance is restarted. on the management console. by calling APIs.	Deploy service applications in HA mode. After the BMS is restarted, check whether services recover.	Services are interrupted.
		BMS unexpected restart	serverReboot	Major	The BMS instance restarts unexpectedly. OS faults. hardware faults.	Deploy service applications in HA mode. After the BMS is restarted, check whether services recover.	Services are interrupted.
		BMS stopped	osShutdown	Major	The BMS instance is stopped. on the management console. by calling APIs.	Deploy service applications in HA mode. After the BMS is restarted, check whether services recover.	Services are interrupted.
		BMS unexpected shutdown	serverShutdown	Major	The BMS stops unexpectedly due to: unexpected power-off. hardware faults.	Deploy service applications in HA mode. After the BMS is restarted, check whether services recover.	Services are interrupted.
		Network disconnection	linkDown	Major	The BMS network is disconnected. Possible causes are as follows: The BMS was stopped or restarted unexpectedly. The switch was faulty. The gateway was faulty.	Deploy service applications in HA mode. After the BMS is restarted, check whether services recover.	Services are interrupted.
		PCIe error	pcieError	Major	The PCIe device or main board on the BMS is faulty. Possible causes are as follows: main board faults. PCIe device faults.	Deploy service applications in HA mode. After the BMS is started, check whether services recover.	The network or disk read/write services are affected.
		Disk fault	diskError	Major	The disk of the BMS is faulty. Possible causes are as follows: disk backplane faults. disk faults.	Deploy service applications in HA mode. After the fault is rectified, check whether services recover.	Data read/write services are affected, or the BMS cannot be started.
		EVS error	storageError	Major	The BMS failed to connect to EVS disks. Possible causes are as follows: The SDI card was faulty. Remote storage devices were faulty.	Deploy service applications in HA mode. After the fault is rectified, check whether services recover.	Data read/write services are affected, or the BMS cannot be started.
		Inforom alarm generated on GPU	gpuInfoROMAlarm	Major	The infoROM of the GPU is abnormal. ROM is an important storage area of the GPU firmware and stores key data loaded during startup.	Non-critical services can continue to use the GPU. For critical services, submit a service ticket to resolve this issue. Restart the VM and check that the issue is not caused by a temporary cache or communication error. If the fault persists after the restart, the hardware may be faulty. Submit a service ticket to check whether the GPU needs to be replaced.	Services will not be affected. If ECC errors are reported on a GPU, faulty pages may not be automatically retired and services are affected.
		Double-bit ECC alarm generated on GPU	doubleBitEccError	Major	A double-bit error occurs in the ECC memory of the GPU. The ECC cannot correct the error, which may cause program breakdown.	If services are interrupted, restart the services. If services cannot be restarted, restart the VM where services are running. If services still cannot be restored, submit a service ticket.	Services may be interrupted. After faulty pages are retired, the GPU can continue to be used.
		Too many retired pages	gpuTooManyRetiredPagesAlarm	Major	An ECC page retirement error occurred on the GPU. When an uncorrectable ECC error occurs on a GPU memory page, the GPU marks the page as retired.	If services are affected, submit a service ticket.	If there are too many ECC errors, services may be affected. If there are too many retired pages and the GPU memory capacity decreases too much, the system performance may deteriorate. If there are too many retired pages and the GPU memory capacity decreases too much, the system may run unstably.
		ECC alarm generated on GPU A100	gpuA100EccAlarm	Major	An ECC error occurred on the GPU.	If services are interrupted, restart the services. If services cannot be restarted, restart the VM where services are running. If services still cannot be restored, submit a service ticket.	Services may be interrupted. After faulty pages are retired, the GPU can continue to be used.
		ECC alarm generated on GPU Ant1	gpuAnt1EccAlarm	Major	An ECC error occurred on GPU.	If services are interrupted, restart the services to restore. If services cannot be restarted, restart the VM where services are running. If services still cannot be restored, submit a service ticket.	Services may be interrupted. After faulty pages are retired, the GPU can continue to be used.
		GPU ECC memory page retirement failure	eccPageRetirementRecordingFailure	Major	Automatic page retirement failed due to ECC errors.	If services are interrupted, restart the services to restore. If services cannot be restarted, restart the VM where services are running. If services still cannot be restored, submit a service ticket.	Services may be interrupted, and memory page retirement fails. As a result, services cannot no longer use the GPU.
		GPU ECC page retirement alarm generated	eccPageRetirementRecordingEvent	Minor	Memory pages are automatically retired due to ECC errors.	1. If services are interrupted, restart the services. 2. If services cannot be restarted, restart the VM where services are running. 3. If services still cannot be restored, submit a service ticket.	Generally, this alarm is generated together with the ECC error alarm. If this alarm is generated independently, services are not affected.
		Too many single-bit ECC errors on GPU	highSingleBitEccErrorRate	Major	There are too many single-bit errors occurring in the ECC memory of the GPU.	If services are interrupted, restart the services to restore. If services cannot be restarted, restart the VM where services are running. If services still cannot be restored, submit a service ticket.	Single-bit errors can be automatically rectified and do not affect GPU-related applications.
		GPU card not found	gpuDriverLinkFailureAlarm	Major	A GPU link is normal, but it cannot be found by the NVIDIA driver.	1. You are advised to try restarting the VM to restore your services. 2. If services still cannot be restored, submit a service ticket.	The GPU cannot be found.
		GPU link faulty	gpuPcieLinkFailureAlarm	Major	GPU hardware information cannot be queried through lspci due to a GPU link fault.	If services are affected, submit a service ticket.	The driver cannot use the GPU.
		VM GPU lost	vmLostGpuAlarm	Major	The number of GPUs on the VM is less than the number specified in the specifications.	If services are affected, submit a service ticket.	GPUs get lost.
		GPU memory page faulty	gpuMemoryPageFault	Major	The GPU memory page is faulty, which may be caused by applications, drivers, or hardware.	If services are affected, submit a service ticket.	The GPU hardware may be faulty. As a result, the GPU memory is faulty, and services exit abnormally.
		GPU image engine faulty	graphicsEngineException	Major	The GPU image engine is faulty, which may be caused by applications, drivers, or hardware.	If services are affected, submit a service ticket.	The GPU hardware may be faulty. As a result, the image engine is faulty, and services exit abnormally.
		GPU temperature too high	highTemperatureEvent	Major	GPU temperature too high	If services are affected, submit a service ticket.	If the GPU temperature exceeds the threshold, the GPU performance may deteriorate.
		GPU NVLink faulty	nvlinkError	Major	A hardware fault occurs on the NVLink.	If services are affected, submit a service ticket.	The NVLink link is faulty and unavailable.
		System maintenance inquiring	system_maintenance_inquiring	Major	The scheduled BMS maintenance task is being inquired.	Authorize the maintenance.	None
		System maintenance waiting	system_maintenance_scheduled	Major	The scheduled BMS maintenance task is waiting to be executed.	Clarify the impact on services during the execution window.	None
		System maintenance canceled	system_maintenance_canceled	Major	The scheduled BMS maintenance is canceled.	None	None
		System maintenance executing	system_maintenance_executing	Major	BMSs are being maintained as scheduled.	After the maintenance is complete, check whether services are affected.	Services are interrupted.
		System maintenance completed	system_maintenance_completed	Major	The scheduled BMS maintenance is completed.	Wait until the BMSs become available and check whether services recover.	None
		System maintenance failure	system_maintenance_failed	Major	The scheduled BMS maintenance task failed.	Contact O&M personnel.	Services are interrupted.
		GPU Xid error	commonXidError	Major	A Xid event alarm was generated on the GPU.	If services are affected, submit a service ticket.	The GPU hardware, driver, and application problems lead to Xid events, which may interrupt services.
		NPU: device not found by npu-smi info	NPUSMICardNotFound	Major	The Ascend driver is faulty or the NPU is disconnected.	Transfer this issue to the Ascend or hardware team for handling.	The NPU cannot be used normally.
		NPU: PCIe link error	PCIeErrorFound	Major	The lspci command returns rev ff indicating that the NPU is abnormal.	Restart the BMS. If the issue persists, transfer it to the hardware team for processing.	The NPU cannot be used normally.
		NPU: device not found by lspci	LspciCardNotFound	Major	The NPU is disconnected.	Transfer this issue to the hardware team for handling.	The NPU cannot be used normally.
		NPU: overtemperature	TemperatureOverUpperLimit	Major	The temperature of DDR or software is too high.	Stop services, restart the BMS, check the heat dissipation system, and reset the devices.	The BMS may be powered off and devices may not be found.
		NPU: uncorrectable ECC error	UncorrectableEccErrorCount	Major	There are uncorrectable ECC errors on the NPU.	If services are affected, replace the NPU with another one.	Services may be interrupted.
		NPU: request for BMS restart	RebootVirtualMachine	Informational	A fault occurs and the BMS needs to be restarted.	Collect the fault information, and restart the BMS.	Services may be interrupted.
		NPU: request for SoC reset	ResetSOC	Informational	A fault occurs and the SoC needs to be reset.	Collect the fault information, and reset the SoC.	Services may be interrupted.
		NPU: request for restart AI process	RestartAIProcess	Informational	A fault occurs and the AI process needs to be restarted.	Collect the fault information, and restart the AI process.	The current AI task will be interrupted.
		NPU: error codes	NPUErrorCodeWarning	Major	A large number of NPU error codes indicating major or higher-level errors are returned. You can further locate the faults based on the error codes.	Locate the faults according to the Black Box Error Code Information List and Health Management Error Definition.	Services may be interrupted.
		nvidia-smi suspended	nvidiaSmiHangEvent	Major	nvidia-smi timed out.	If services are affected, submit a service ticket.	The driver may report an error during service running.
		nv_peer_mem loading error	NvPeerMemException	Minor	The NVLink or nv_peer_mem cannot be loaded.	Restore or reinstall the NVLink.	nv_peer_mem cannot be used.
		Fabric Manager error	NvFabricManagerException	Minor	The BMS meets the NVLink conditions and NVLink is installed, but Fabric Manager is abnormal.	Restore or reinstall the NVLink.	NVLink cannot be used normally.
		IB card error	InfinibandStatusException	Major	The IB card or its physical status is abnormal.	Transfer this issue to the hardware team for handling.	The IB card cannot work normally.
		GPU throttle alarm	gpuClocksThrottleReasonsAlarm	Informational	The GPU power may exceed the maximum operating power threshold (continuous full load). The clock frequency automatically decreases to prevent the GPU from being damaged. The GPU temperature may exceed the maximum operating temperature threshold (continuous full load). The clock frequency automatically decreases to reduce heat. The GPU may remain idle, with the clock frequency automatically decreasing to reduce power consumption. Hardware faults may cause a decrease in clock frequency.	Check whether the clock frequency decrease is caused by hardware faults. If yes, transfer it to the hardware team.	The GPU slows down, resulting in less powerful compute.
		Pending page retirement for GPU DRAM ECC	gpuRetiredPagesPendingAlarm	Major	An ECC error occurred on the hardware. DRAM pages need to be retired. An uncorrectable ECC error occurred on the GPU memory page and the page needs to be retired. However, the page is suspended and has not been retired yet.	View the event details and check whether the value of retired_pages.pending is yes. Restart the GPU for automatic retirement.	The GPU cannot work properly.
		Pending row remapping for GPU DRAM ECC	gpuRemappedRowsAlarm	Major	Some rows in the GPU memory have errors and need to be remapped. The faulty rows must be mapped to standby resources.	View the event metric "RemappedRow" to check if there are any rows that have been remapped. Restart the GPU for automatic retirement.	The GPU cannot work properly.
		Insufficient resources for GPU DRAM ECC row remapping	gpuRowRemapperResourceAlarm	Major	This event occurs on GPUs (Ampere and later architectures). The standby GPU memory row resources are exhausted, so row remapping cannot be continued.	Transfer the issue to the hardware team.	The GPU cannot work properly.
		Correctable GPU DRAM ECC error	gpuDRAMCorrectableEccError	Major	This event occurs on GPUs (Ampere and later architectures). A correctable ECC error occurs in the DRAM of the GPU. However, the ECC mechanism can automatically rectify the error and programs are not affected.	View the event metric "ecc.errors.corrected.volatile" to check whether there are any correctable ECC error values. Restart the GPU for automatic retirement.	The GPU may not work properly.
		Uncorrectable GPU DRAM ECC error	gpuDRAMUncorrectableEccError	Major	This event occurs on GPUs (Ampere and later architectures). An uncorrectable ECC error occurs in the DRAM of the GPU. This error cannot be automatically corrected using the ECC mechanism. The verification process affects system stability and may cause program crashes.	View the event metric "ecc.errors.uncorrected.volatile" to check whether there are any uncorrectable ECC error values. Restart the GPU for automatic retirement.	The GPU may not work properly.
		Inconsistent GPU kernel versions	gpuKernelVersionInconsistencyAlarm	Major	Inconsistent GPU kernel versions. During driver installation, the GPU driver is compiled based on the kernel at that time. If the kernel versions are identified inconsistent, the kernel has been customized after the driver installation. In this case, the driver would become unavailable and needs to be reinstalled.	Run the following commands to rectify the issue: rmmod nvidia_drm rmmod nvidia_modeset rmmod nvidia Then, run nvidia-smi. If the command output is normal, the issue has been rectified. If the preceding solution does not work, rectify the fault by referring to	The GPU cannot work properly.
		GPU monitoring dependency not met	gpuCheckEnvFailedAlarm	Major	The plug-in cannot identify the GPU driver library path.	Check whether the driver is installed. Check whether the driver installation directory has been customized. The driver needs to be installed in the default installation directory /usr/bin/.	Collection failure of GPU monitoring metrics
		Initialization failure of the GPU monitoring driver library	gpuDriverInitFailedAlarm	Major	The GPU driver is unavailable.	Run nvidia-smi to check whether the driver is unavailable. If the driver is unavailable, reinstall the driver by referring to .	Collection failure of GPU monitoring metrics
		Initialization timeout of the GPU monitoring driver library	gpuDriverInitTAlarm	Major	The GPU driver initialization timed out (exceeding 10s).	If the driver is not installed, install it by referring to . If the driver is installed, run nvidia-smi to check whether the driver is available. If the driver is unavailable, reinstall the driver by referring to . If the driver is properly installed, check whether the high-performance mode is enabled. If not, run nvidia-smi -pm 1 to enable it. P0 indicates the high-performance mode.	Collection failure of GPU monitoring metrics
		GPU metric collection timeout	gpuCollectMetricTimeoutAlarm	Major	The GPU metric collection timed out (exceeding 10s).	If the library API timed out, run nvidia-smi to check whether the driver is available. If the driver is unavailable, reinstall the driver by referring to . If the command execution timed out, check the system logs and determine whether there is an issue with the system.	GPU monitoring metric data is missing. As a result, subsequent metrics may fail to be collected.
		GPU handle lost	gpuDeviceHandleLost	Major	The GPU metric information cannot be obtained, and the GPU may be lost.	Run nvidia-smi to check whether there are any errors reported. Run nvidia-smi -L to check whether the number of GPUs is the same as the server specifications. Submit a service ticket to contact on-call support.	All metrics of the GPU are lost.
		Failed to listen to the XID of the GPU.	gpuDeviceXidLost	Major	Failed to listen to the XID metric.	Check whether the GPU is lost or damaged. Submit a service ticket to contact on-call support.	Failed to obtain XID-related metrics of the GPU.
		Multiple NPU HBM ECC errors	NpuHbmMultiEccInfo	Informational	There are NPU HBM ECC errors.	This event is only a reference for other events. You do not need to handle it separately.	The NPU may not work properly.
		ReadOnly issues in OS	ReadOnlyFileSystem	Critical	The file system %s is read-only.	Check the disk health status.	The files cannot be written.
		NPU: driver and firmware not matching	NpuDriverFirmwareMismatch	Major	The NPU's driver and firmware do not match.	Obtain the matched version from the Ascend official website and reinstall it.	NPUs cannot be used.
		NPU: Docker container environment check	NpuContainerEnvSystem	Major	Docker was unavailable.	Check if Docker is normal.	Docker cannot be used.
				Major	The container plug-in Ascend-Docker-Runtime was not installed.	Install the container plug-in Ascend-Docker-Runtime. Or, the container cannot use Ascend cards.	NPUs cannot be attached to Docker containers.
				Major	IP forwarding was not enabled in the OS.	Check the net.ipv4.ip_forward configuration in the /etc/sysctl.conf file.	Docker containers experience network communication problems.
				Major	The shared memory of the container was too small.	The default shared memory is 64 MB, which can be modified as needed. Method 1 Modify the default-shm-size field in the /etc/docker/daemon.json configuration file. Method 2 Use the --shm-size parameter in the docker run command to set the shared memory size of a container.	Distributed training will fail due to insufficient shared memory.
NPU: RoCE NIC down	RoCELinkStatusDown	Major	The RoCE link of NPU card %d was down.	Check the NPU RoCE network port status.	The NPU NIC becomes unavailable.
NPU: RoCE NIC health status abnormal	RoCEHealthStatusError	Major	The RoCE network health status of NPU %d was abnormal.	Check the health status of the NPU RoCE NIC.	The NPU NIC becomes unavailable.
NPU: RoCE NIC configuration file /etc/hccn.conf not found	HccnConfNotExisted	Major	The RoCE NIC configuration file /etc/hccn.conf was not found.	Check whether the /etc/hccn.conf NIC configuration file can be found.	The RoCE NIC becomes unavailable.
GPU: basic components abnormal	GpuEnvironmentSystem	Major	The nvidia-smi command was abnormal.	Check whether the GPU driver is normal.	The GPU driver is unavailable.
		Major	The nvidia-fabricmanager version was inconsistent with the GPU driver version.	Check the GPU driver version and nvidia-fabricmanager version.	The nvidia-fabricmanager cannot work properly, affecting GPU usage.
		Major	The container plug-in nvidia-container-toolkit was not installed.	Install the container plug-in nvidia-container-toolkit.	GPUs cannot be attached to Docker containers.
Local disk attachment inspection	MountDiskSystem	Major	The /etc/fstab file contains invalid UUIDs.	Ensure that the UUIDs in the /etc/fstab configuration file are correct. Or, the server may fail to be restarted.	The disk attachment process fails, preventing the server from restarting.
GPU: incorrectly configured dynamic route for Ant series server	GpuRouteConfigError	Major	The dynamic route of the NIC %s of an Ant series server was not configured or was incorrectly configured. CMD [ip route]: %s \| CMD [ip route show table all]: %s.	Configure the RoCE NIC route correctly.	The NPU network communication will be interrupted.
NPU: RoCE port not split	RoCEUdpConfigError	Major	The RoCE UDP port was not split.	Check the RoCE UDP port configuration on the NPU.	The communication performance of NPUs is affected.
Warning of automatic system kernel upgrade	KernelUpgradeWarning	Major	Warning of automatic system kernel upgrade. Old version: %s; new version: %s.	System kernel upgrade may cause AI software exceptions. Check the system update logs and prevent the server from restarting.	The AI software may be unavailable.
NPU environment command detection	NpuToolsWarning	Major	The hccn_tool was unavailable.	Check whether the NPU driver is normal.	The IP address and gateway of the RoCE NIC cannot be configured.
		Major	The npu-smi was unavailable.	Check whether the NPU driver is normal.	NPUs cannot be used.
		Major	The ascend-dmi was unavailable.	Check whether ToolBox is properly installed.	ascend-dmi cannot be used for performance analysis.
Warning of an NPU driver exception	NpuDriverAbnormalWarning	Major	The NPU driver was abnormal.	Reinstall the NPU driver.	NPUs cannot be used.
GPU: invalid RoCE NIC configuration	GpuRoceNicConfigIncorrect	Major	The RoCE NIC of the GPU is incorrectly configured.	Contact O&M personnel.	The parameter plane network is abnormal, preventing the execution of the multi-node task.
Local disk replacement to be authorized	localdisk_recovery_inquiring	Major	The local disk is faulty. Local disk replacement authorization is in progress.	Authorize local disk replacement.	Local disks are unavailable.
Local disks being replaced	localdisk_recovery_executing	Major	The local disk is faulty and is being replaced.	When the replacement is complete, check whether the local disks are available.	Local disks are unavailable.
Local disks replaced	localdisk_recovery_completed	Major	The local disk is faulty and is replaced.	Wait until the services are running properly and check whether local disks are available.	None
Local disk replacement failed	localdisk_recovery_failed	Major	The local disk is faulty and fails to be replaced.	Contact O&M personnel.	Local disks are unavailable.

**Table 3** Elastic IP (EIP)
Event Source	Namespace	Event Name	Event ID	Event Severity	Description	Solution	Impact
EIP	SYS.EIP	EIP bandwidth exceeded	EIPBandwidthOverflow	Major	The used bandwidth exceeded the purchased one, which may slow down the network or cause packet loss. The value of this event is the maximum value in a monitoring period, and the value of the EIP inbound and outbound bandwidth is the value at a specific time point in the period. The metrics are described as follows: egressDropBandwidth: dropped outbound packets (bytes) egressAcceptBandwidth: accepted outbound packets (bytes) egressMaxBandwidthPerSec: peak outbound bandwidth (byte/s) ingressAcceptBandwidth: accepted inbound packets (bytes) ingressMaxBandwidthPerSec: peak inbound bandwidth (byte/s) ingressDropBandwidth: dropped inbound packets (bytes)	Check whether the EIP bandwidth keeps increasing and whether services are normal. Increase bandwidth if necessary.	The network becomes slow or packets are lost.
		EIP released	deleteEip	Minor	The EIP was released.	Check whether the EIP was release by mistake.	The server that has the EIP bound cannot access the Internet.
		EIP blocked	blockEIP	Critical	The used bandwidth of an EIP exceeded 5 Gbit/s, the EIP were blocked and packets were discarded. Such an event may be caused by DDoS attacks.	Replace the EIP to prevent services from being affected. Locate and deal with the fault.	Services are impacted.
		EIP unblocked	unblockEIP	Critical	The EIP was unblocked.	Use the previous EIP again.	None
		EIP traffic scrubbing started	ddosCleanEIP	Major	Traffic scrubbing on the EIP was started to prevent DDoS attacks.	Check whether the EIP was attacked.	Services may be interrupted.
		EIP traffic scrubbing ended	ddosEndCleanEip	Major	Traffic scrubbing on the EIP to prevent DDoS attacks was ended.	Check whether the EIP was attacked.	Services may be interrupted.
		QoS bandwidth exceeded	EIPBandwidthRuleOverflow	Major	The used QoS bandwidth exceeded the allocated one, which may slow down the network or cause packet loss. The value of this event is the maximum value in a monitoring period, and the value of the EIP inbound and outbound bandwidth is the value at a specific time point in the period. egressDropBandwidth: dropped outbound packets (bytes) egressAcceptBandwidth: accepted outbound packets (bytes) egressMaxBandwidthPerSec: peak outbound bandwidth (byte/s) ingressAcceptBandwidth: accepted inbound packets (bytes) ingressMaxBandwidthPerSec: peak inbound bandwidth (byte/s) ingressDropBandwidth: dropped inbound packets (bytes)	Check whether the EIP bandwidth keeps increasing and whether services are normal. Increase bandwidth if necessary.	The network becomes slow or packets are lost.
		EIP unbound with resources	EipNotBoundStatus	Major	The EIP is unbound with instance resources.	None	When an EIP is unbound, you will be billed for IP reservation fees and bandwidth fees (billed by bandwidth).

**Table 4** Advanced Anti-DDoS (AAD)
Event Source	Namespace	Event Name	Event ID	Event Severity	Description	Solution	Impact
AAD	SYS.DDOS	DDoS Attack Events	ddosAttackEvents	Major	A DDoS attack occurs in the AAD protected lines.	Judge the impact on services based on the attack traffic and attack type. If the attack traffic exceeds your purchased elastic bandwidth, change to another line or increase your bandwidth.	Services may be interrupted.
		Domain name scheduling event	domainNameDispatchEvents	Major	The high-defense CNAME corresponding to the domain name is scheduled, and the domain name is resolved to another high-defense IP address.	Pay attention to the workloads involving the domain name.	Services are not affected.
		Blackhole event	blackHoleEvents	Major	The attack traffic exceeds the purchased AAD protection threshold.	A blackhole is canceled after 30 minutes by default. The actual blackhole duration is related to the blackhole triggering times and peak attack traffic on the current day. The maximum duration is 24 hours. If you need to permit access before a blackhole becomes ineffective, contact technical support.	Services may be interrupted.
		Cancel Blackhole	cancelBlackHole	Informational	The customer's AAD instance recovers from the black hole state.	This is only a prompt and no action is required.	Customer services recover.
		IP address scheduling triggered	ipDispatchEvents	Major	IP route changed	Check the workloads of the IP address.	Services are not affected.

**Table 5** Elastic Load Balance (ELB)
Event Source	Namespace	Event Name	Event ID	Event Severity	Description	Solution	Impact
ELB	SYS.ELB	The backend servers are unhealthy.	healthCheckUnhealthy	Major	Generally, this problem occurs because backend server services are offline. This event will not be reported after it is reported for several times.	Ensure that the backend servers are running properly.	ELB does not forward requests to unhealthy backend servers. If all backend servers in the backend server group are detected unhealthy, services will be interrupted.
ELB	SYS.ELB	The backend server is detected healthy.	healthCheckRecovery	Minor	The backend server is detected healthy.	No further action is required.	The load balancer can properly route requests to the backend server.

**Table 6** Cloud Backup and Recovery (CBR)
Event Source	Namespace	Event Name	Event ID	Event Severity	Description	Solution	Impact
CBR	SYS.CBR	Failed to create the backup.	backupFailed	Critical	The backup failed to be created.	Manually create a backup or contact customer service.	Data loss may occur.
		Failed to restore the resource using a backup.	restorationFailed	Critical	The resource failed to be restored using a backup.	Restore the resource using another backup or contact customer service.	Data loss may occur.
		Failed to delete the backup.	backupDeleteFailed	Critical	The backup failed to be deleted.	Try again later or contact customer service.	Charging may be abnormal.
		Failed to delete the vault.	vaultDeleteFailed	Critical	The vault failed to be deleted.	Try again later or contact technical support.	Charging may be abnormal.
		Replication failure	replicationFailed	Critical	The backup failed to be replicated.	Try again later or contact technical support.	Data loss may occur.
		The backup is created successfully.	backupSucceeded	Major	The backup was created.	None	None
		Resource restoration using a backup succeeded.	restorationSucceeded	Major	The resource was restored using a backup.	Check whether the data is successfully restored.	None
		The backup is deleted successfully.	backupDeletionSucceeded	Major	The backup was deleted.	None	None
		The vault is deleted successfully.	vaultDeletionSucceeded	Major	The vault was deleted.	None	None
		Replication success	replicationSucceeded	Major	The backup was replicated successfully.	None	None
		Client offline	agentOffline	Critical	The backup client was offline.	Ensure that the Agent status is normal and the backup client can be connected to Huawei Cloud.	Backup tasks may fail.
		Client online	agentOnline	Major	The backup client was online.	None	None

**Table 7** Relational Database Service (RDS) — resource exception
Event Source	Namespace	Event Name	Event ID	Event Severity	Description	Solution	Impact
RDS	SYS.RDS	DB instance creation failure	createInstanceFailed	Major	Generally, the cause is that the number of disks is insufficient due to quota limits, or underlying resources are exhausted.	The selected resource specifications are insufficient. Select other available specifications and try again.	DB instances cannot be created.
		Full backup failure	fullBackupFailed	Major	A single full backup failure does not affect the files that have been successfully backed up, but prolong the incremental backup time during the point-in-time restore (PITR).	Try again.	Full backup failed.
		Read replica promotion failure	activeStandBySwitchFailed	Major	The standby DB instance does not take over workloads from the primary DB instance due to network or server failures. The original primary DB instance continues to provide services within a short time.	Perform the switchover again during off-peak hours.	The primary/standby switchover will fail.
		Replication status abnormal	abnormalReplicationStatus	Major	The possible causes are as follows: The replication delay between the primary instance and the standby instance or a read replica is too long, which usually occurs when a large amount of data is being written to databases or a large transaction is being processed. During peak hours, data may be blocked. The network between the primary instance and the standby instance or a read replica is disconnected.	Database replication is being repaired. You will be notified immediately after the repair.	The replication status is abnormal.
		Replication status recovered	replicationStatusRecovered	Major	The replication delay between the primary and standby instances is within the normal range, or the network connection between them has restored.	Check whether services are running properly.	Replication status is recovered.
		DB instance faulty	faultyDBInstance	Major	A single or primary DB instance was faulty due to a catastrophic failure, for example, server failure.	Instance status is being repaired. You will be notified immediately after the repair.	The instance status is abnormal.
		DB instance recovered	DBInstanceRecovered	Major	RDS rebuilds the standby DB instance with its high availability. After the instance is rebuilt, this event will be reported.	The DB instance status is normal. Check whether services are running properly.	The instance is recovered.
		Failure of changing single DB instance to primary/standby	singleToHaFailed	Major	A fault occurs when RDS is creating the standby DB instance or configuring replication between the primary and standby DB instances. The fault may occur because resources are insufficient in the data center where the standby DB instance is located.	Automatic retry is in progress.	Changing a single DB instance to primary/standby failed.
		Database process restarted	DatabaseProcessRestarted	Major	The database process is stopped due to insufficient memory or high load.	Check whether services are running properly.	The primary instance is restarted. Services are interrupted for a short period of time.
		Instance storage full	instanceDiskFull	Major	Generally, the cause is that the data space usage is too high.	Scale up the storage.	The instance storage is used up. No data can be written into databases.
		Instance storage full recovered	instanceDiskFullRecovered	Major	The instance disk is recovered.	Check whether services are running properly.	The instance has available storage.
		Kafka connection failed	kafkaConnectionFailed	Major	The network is unstable or the Kafka server does not work properly.	Check whether services are affected.	None

**Table 8** Relational Database Service (RDS) — operations
Event Source	Namespace	Event Name	Event ID	Event Severity	Description
RDS	SYS.RDS	Reset administrator password	resetPassword	Major	The password of the database administrator is reset.
		Operate DB instance	instanceAction	Major	The storage space is scaled or the instance class is changed.
		Delete DB instance	deleteInstance	Minor	The DB instance is deleted.
		Modify backup policy	setBackupPolicy	Minor	The backup policy is modified.
		Modify parameter group	updateParameterGroup	Minor	The parameter group is modified.
		Delete parameter group	deleteParameterGroup	Minor	The parameter group is deleted.
		Reset parameter group	resetParameterGroup	Minor	The parameter group is reset.
		Change database port	changeInstancePort	Major	The database port is changed.
		Primary/standby switchover or failover	PrimaryStandbySwitched	Major	A switchover or failover is performed.

**Table 9** Document Database Service (DDS)
Event Source	Namespace	Event Name	Event ID	Event Severity	Description	Solution	Impact
DDS	SYS.DDS	DB instance creation failure	DDSCreateInstanceFailed	Major	A DDS instance fails to be created due to insufficient disks, quotas, and underlying resources.	Check the number and quota of disks. Release resources and create DDS instances again.	DDS instances cannot be created.
		Replication failed	DDSAbnormalReplicationStatus	Major	The possible causes are as follows: The replication delay between the primary instance and the standby instance or a read replica is too long, which usually occurs when a large amount of data is being written to databases or a large transaction is being processed. During peak hours, data may be blocked. The network between the primary instance and the standby instance or a read replica is disconnected.	Submit a service ticket.	Read and write operations on the original instance are not interrupted, but data updates on the standby instance may experience delays. The replication delay keeps growing between the primary and standby instances, and the standby instance may be disconnected.
		Replication status recovered	DDSReplicationStatusRecovered	Major	The replication delay between the primary and standby instances is within the normal range, or the network connection between them has restored.	No action is required.	None
		DB instance failed	DDSFaultyDBInstance	Major	This event is a key alarm event and is reported when an instance is faulty due to a disaster or a server failure.	Submit a service ticket.	The database service may be unavailable.
		DB instance recovered	DDSDBInstanceRecovered	Major	If a disaster occurs, NoSQL provides an HA tool to automatically or manually rectify the fault. After the fault is rectified, this event is reported.	No action is required.	None
		Faulty node	DDSFaultyDBNode	Major	This event is a key alarm event and is reported when a database node is faulty due to a disaster or a server failure.	Check whether the database service is available and submit a service ticket.	The database service may be unavailable.
		Node recovered	DDSDBNodeRecovered	Major	If a disaster occurs, NoSQL provides an HA tool to automatically or manually rectify the fault. After the fault is rectified, this event is reported.	No action is required.	None
		Primary/standby switchover or failover	DDSPrimaryStandbySwitched	Major	This event is reported when a primary/standby switchover or a failover is triggered.	No action is required.	None
		Insufficient storage space	DDSRiskyDataDiskUsage	Major	The storage space is insufficient.	Scale up storage space. For details, see section "Scaling Up Storage Space" in the corresponding user guide.	The instance is set to read-only and data cannot be written to the instance.
		Data disk expanded and being writable	DDSDataDiskUsageRecovered	Major	The capacity of a data disk has been expanded and the data disk becomes writable.	No further action is required.	No adverse impact.
		Schedule for deleting a KMS key	planDeleteKmsKey	Major	A request to schedule deletion of a KMS key was submitted.	After the KMS key is scheduled to be deleted, either decrypt the data encrypted by KMS key in a timely manner or cancel the key deletion.	After the KMS key is deleted, users cannot encrypt disks.
		Full backup failure	DDSFullBackupFailed	Major	A single full backup failure does not affect the files that have been successfully backed up, but prolong the incremental backup time during the point-in-time restore (PITR).	Try again.	Full backup fail.

**Table 10** GeminiDB
Event Source	Namespace	Event Name	Event ID	Event Severity	Description	Solution	Impact
GeminiDB	SYS.NoSQL	DB instance creation failed	NoSQLCreateInstanceFailed	Major	The instance quota or underlying resources are insufficient.	Release the instances that are no longer used and try to provision them again, or submit a service ticket to adjust the quota.	DB instances cannot be created.
		Specifications modification failed	NoSQLResizeInstanceFailed	Major	The underlying resources are insufficient.	Submit a service ticket. The O&M personnel will coordinate resources in the background, and then you need to change the specifications again.	Services are interrupted.
		Node adding failed	NoSQLAddNodesFailed	Major	The underlying resources are insufficient.	Submit a service ticket. The O&M personnel will coordinate resources in the background, and then you delete the node that failed to be added and add a new node.	None
		Node deletion failed	NoSQLDeleteNodesFailed	Major	The underlying resources fail to be released.	Delete the node again.	None
		Storage space scale-up failed	NoSQLScaleUpStorageFailed	Major	The underlying resources are insufficient.	Submit a service ticket. The O&M personnel will coordinate resources in the background and then you scale up the storage space again.	Services may be interrupted.
		Password reset failed	NoSQLResetPasswordFailed	Major	Resetting the password times out.	Reset the password again.	None
		Parameter group change failed	NoSQLUpdateInstanceParamGroupFailed	Major	Changing a parameter group times out.	Change the parameter group again.	None
		Backup policy configuration failed	NoSQLSetBackupPolicyFailed	Major	The database connection is abnormal.	Configure the backup policy again.	None
		Manual backup creation failed	NoSQLCreateManualBackupFailed	Major	The backup files fail to be exported or uploaded.	Submit a service ticket to the O&M personnel.	Data cannot be backed up.
		Automated backup creation failed	NoSQLCreateAutomatedBackupFailed	Major	The backup files fail to be exported or uploaded.	Submit a service ticket to the O&M personnel.	Data cannot be backed up.
		Faulty DB instance	NoSQLFaultyDBInstance	Major	This event is a key alarm event and is reported when an instance is faulty due to a disaster or a server failure.	Submit a service ticket.	The database service may be unavailable.
		DB instance recovered	NoSQLDBInstanceRecovered	Major	If a disaster occurs, NoSQL provides an HA tool to automatically or manually rectify the fault. After the fault is rectified, this event is reported.	No action is required.	None
		Faulty node	NoSQLFaultyDBNode	Major	This event is a key alarm event and is reported when a database node is faulty due to a disaster or a server failure.	Check whether the database service is available and submit a service ticket.	The database service may be unavailable.
		Node recovered	NoSQLDBNodeRecovered	Major	If a disaster occurs, NoSQL provides an HA tool to automatically or manually rectify the fault. After the fault is rectified, this event is reported.	No action is required.	None
		Primary/standby switchover or failover	NoSQLPrimaryStandbySwitched	Major	This event is reported when a primary/secondary switchover or failover is triggered.	No action is required.	None
		HotKey occurred	HotKeyOccurs	Major	The primary key is improperly configured. As a result, hotspot data is distributed in one partition. The improper application design causes frequent read and write operations on a key.	1. Choose a proper partition key. 2. Add service cache. The service application reads hotspot data from the cache first.	The service request success rate is affected, and the cluster performance and stability also be affected.
		BigKey occurred	BigKeyOccurs	Major	The primary key design is improper. The number of records or data in a single partition is too large, causing unbalanced node loads.	1. Choose a proper partition key. 2. Add a new partition key for hashing data.	As the data in the large partition increases, the cluster stability deteriorates.
		Insufficient storage space	NoSQLRiskyDataDiskUsage	Major	The storage space is insufficient.	Scale up storage space. For details, see section "Scaling Up Storage Space" in the corresponding user guide.	The instance is set to read-only and data cannot be written to the instance.
		Data disk expanded and being writable	NoSQLDataDiskUsageRecovered	Major	The capacity of a data disk has been expanded and the data disk becomes writable.	No operation is required.	None
		Index creation failed	NoSQLCreateIndexFailed	Major	The service load exceeds what the instance specifications can take. In this case, creating indexes consumes more instance resources. As a result, the response is slow or even frame freezing occurs, and the creation times out.	Select the matched instance specifications based on the service load. Create indexes during off-peak hours. Create indexes in the background. Select indexes as required.	The index fails to be created or is incomplete. As a result, the index is invalid. Delete the index and create an index.
		Write speed decreased	NoSQLStallingOccurs	Major	The write speed is fast, which is close to the maximum write capability allowed by the cluster scale and instance specifications. As a result, the flow control mechanism of the database is triggered, and requests may fail.	1. Adjust the cluster scale or node specifications based on the maximum write rate of services. 2. Measures the maximum write rate of services.	The success rate of service requests is affected.
		Data write stopped	NoSQLStoppingOccurs	Major	The data write is too fast, reaching the maximum write capability allowed by the cluster scale and instance specifications. As a result, the flow control mechanism of the database is triggered, and requests may fail.	1. Adjust the cluster scale or node specifications based on the maximum write rate of services. 2. Measures the maximum write rate of services.	The success rate of service requests is affected.
		Database restart failed	NoSQLRestartDBFailed	Major	The instance status is abnormal.	Submit a service ticket to the O&M personnel.	The DB instance status may be abnormal.
		Restoration to new DB instance failed	NoSQLRestoreToNewInstanceFailed	Major	The underlying resources are insufficient.	Submit a service order to ask the O&M personnel to coordinate resources in the background and add new nodes.	Data cannot be restored to a new DB instance.
		Restoration to existing DB instance failed	NoSQLRestoreToExistInstanceFailed	Major	The backup file fails to be downloaded or restored.	Submit a service ticket to the O&M personnel.	The current DB instance may be unavailable.
		Backup file deletion failed	NoSQLDeleteBackupFailed	Major	The backup files fail to be deleted from OBS.	Delete the backup files again.	None
		Failed to enable Show Original Log	NoSQLSwitchSlowlogPlainTextFailed	Major	The DB engine does not support this function.	Refer to the GaussDB NoSQL User Guide to ensure that the DB engine supports Show Original Log. Submit a service ticket to the O&M personnel.	None
		EIP binding failed	NoSQLBindEipFailed	Major	The node status is abnormal, an EIP has been bound to the node, or the EIP to be bound is invalid.	Check whether the node is normal and whether the EIP is valid.	The DB instance cannot be accessed from the Internet.
		EIP unbinding failed	NoSQLUnbindEipFailed	Major	The node status is abnormal or the EIP has been unbound from the node.	Check whether the node and EIP status are normal.	None
		Parameter modification failed	NoSQLModifyParameterFailed	Major	The parameter value is invalid.	Check whether the parameter value is within the valid range and submit a service ticket to the O&M personnel.	None
		Parameter group application failed	NoSQLApplyParameterGroupFailed	Major	The instance status is abnormal. As a result, the parameter group cannot be applied.	Submit a service ticket to the O&M personnel.	None
		Failed to enable or disable SSL	NoSQLSwitchSSLFailed	Major	Enabling or disabling SSL times out.	Try again or submit a service ticket. Do not change the connection mode.	The connection mode cannot be changed.
		Row size too large	LargeRowOccurs	Major	If there is too much data in a single row, queries may time out, causing faults like OOM error.	1. Control the length of each column and row so that the sum of key and value lengths in each row does not exceed the preset threshold. 2. Check whether there are invalid writes or encoding resulting in large keys or values.	If there are rows that are too large, the cluster performance will deteriorate as the data volume grows.
		Schedule for deleting a KMS key	planDeleteKmsKey	Major	A request to schedule deletion of a KMS key was submitted.	After the KMS key is scheduled to be deleted, either decrypt the data encrypted by KMS key in a timely manner or cancel the key deletion.	After the KMS key is deleted, users cannot encrypt disks.
		Too many query tombstones	TooManyQueryTombstones	Major	If there are too many query tombstones, queries may time out, affecting query performance.	Select right query and deleting methods and avoid long range queries.	Queries may time out, affecting query performance.
		Too large collection column	TooLargeCollectionColumn	Major	If there are too many elements in a collection column, queries to the column will fail.	Limit elements in a collection column. Check for abnormal writes or coding at the service side.	Queries to the collection column will fail.
		GeminiDB Influx instance connection limit reached	InfluxDBConnectionFull	Major	The connections on the instance node reach the upper limit.	1. Upgrade specifications if they cannot meet service requirements. 2. Check whether the client properly manages connections, for example, whether there are unreleased or long connections.	If no new connection can be created on a node, the client may fail to connect to a GeminiDB Influx instance. As a result, services may become instable.
		High availability switchover	nodeHaSwitch	Major	The high availability switchover is triggered by underlying network jitters.	Check whether the business is normal and it can be restored automatically.	The network jitter causes a few seconds of delay.

**Table 11** TaurusDB
Event Source	Namespace	Event Name	Event ID	Event Severity	Description	Solution	Impact
TaurusDB	SYS.GAUSSDB	Incremental backup failure	TaurusIncrementalBackupInstanceFailed	Major	The network between the instance and the management plane (or the OBS) is disconnected, or the backup environment created for the instance is abnormal.	Submit a service ticket.	Backup jobs fail.
		Read replica creation failure	addReadonlyNodesFailed	Major	The quota is insufficient or underlying resources are exhausted.	Check the read replica quota. Release resources and create read replicas again.	Read replicas fail to be created.
		DB instance creation failure	createInstanceFailed	Major	The instance quota or underlying resources are insufficient.	Check the instance quota. Release resources and create instances again.	DB instances fail to be created.
		Read replica promotion failure	activeStandBySwitchFailed	Major	The read replica fails to be promoted to the primary node due to network or server failures. The original primary node takes over services quickly.	Submit a service ticket.	The read replica fails to be promoted to the primary node.
		Instance specifications change failure	flavorAlterationFailed	Major	The quota is insufficient or underlying resources are exhausted.	Submit a service ticket.	Instance specifications fail to be changed.
		Faulty DB instance	TaurusInstanceRunningStatusAbnormal	Major	The instance process is faulty or the communications between the instance and the DFV storage are abnormal.	Submit a service ticket.	Services may be affected.
		DB instance recovered	TaurusInstanceRunningStatusRecovered	Major	The instance is recovered.	Observe the service running status.	None
		Faulty node	TaurusNodeRunningStatusAbnormal	Major	The node process is faulty or the communications between the node and the DFV storage are abnormal.	Observe the instance and service running statuses.	A read replica may be promoted to the primary node.
		Node recovered	TaurusNodeRunningStatusRecovered	Major	The node is recovered.	Observe the service running status.	None
		Read replica deletion failure	TaurusDeleteReadOnlyNodeFailed	Major	The communications between the management plane and the read replica are abnormal or the VM fails to be deleted from IaaS.	Submit a service ticket.	Read replicas fail to be deleted.
		Password reset failure	TaurusResetInstancePasswordFailed	Major	The communications between the management plane and the instance are abnormal or the instance is abnormal.	Check the instance status and try again. If the fault persists, submit a service ticket.	Passwords fail to be reset for instances.
		DB instance reboot failure	TaurusRestartInstanceFailed	Major	The network between the management plane and the instance is abnormal or the instance is abnormal.	Check the instance status and try again. If the fault persists, submit a service ticket.	Instances fail to be rebooted.
		Restoration to new DB instance failure	TaurusRestoreToNewInstanceFailed	Major	The instance quota is insufficient, underlying resources are exhausted, or the data restoration logic is incorrect.	If the new instance fails to be created, check the instance quota, release resources, and try to restore to a new instance again. In other cases, submit a service ticket.	Backup data fails to be restored to new instances.
		EIP binding failure	TaurusBindEIPToInstanceFailed	Major	The binding task fails.	Submit a service ticket.	EIPs fail to be bound to instances.
		EIP unbinding failure	TaurusUnbindEIPFromInstanceFailed	Major	The unbinding task fails.	Submit a service ticket.	EIPs fail to be unbound from instances.
		Parameter modification failure	TaurusUpdateInstanceParameterFailed	Major	The network between the management plane and the instance is abnormal or the instance is abnormal.	Check the instance status and try again. If the fault persists, submit a service ticket.	Instance parameters fail to be modified.
		Parameter template application failure	TaurusApplyParameterGroupToInstanceFailed	Major	The network between the management plane and instances is abnormal or the instances are abnormal.	Check the instance status and try again. If the fault persists, submit a service ticket.	Parameter templates fail to be applied to instances.
		Full backup failure	TaurusBackupInstanceFailed	Major	The network between the instance and the management plane (or the OBS) is disconnected, or the backup environment created for the instance is abnormal.	Submit a service ticket.	Backup jobs fail.
		Primary/standby failover	TaurusActiveStandbySwitched	Major	When the network, physical machine, or database of the primary node is faulty, the system promotes a read replica to primary based on the failover priority to ensure service continuity.	Check whether the service is running properly. Check whether an alarm is generated, indicating that the read replica failed to be promoted to primary.	During the failover, database connection is interrupted for a short period of time. After the failover is complete, you can reconnect to the database.
		Database read-only	NodeReadonlyMode	Major	The database supports only query operations.	Submit a service ticket.	After the database becomes read-only, write operations cannot be processed.
		Database read/write	NodeReadWriteMode	Major	The database supports both write and read operations.	Submit a service ticket.	None
		Instance DR switchover	DisasterSwitchOver	Major	If an instance is faulty and unavailable, a switchover is performed to ensure that the instance continues to provide services.	Contact technical support.	The database connection is intermittently interrupted. The HA service switches workloads from the primary node to a read replica and continues to provide services.
		Database process restarted	TaurusDatabaseProcessRestarted	Major	The database process is stopped due to insufficient memory or high load.	Log in to the Cloud Eye console. Check whether the memory usage increases sharply or the CPU usage is too high for a long time. You can increase the specifications or optimize the service logic.	When the database process is suspended, workloads on the node are interrupted. In this case, the HA service automatically restarts the database process and attempts to recover the workloads.

**Table 12** GaussDB
Event Source	Namespace	Event Name	Event ID	Event Severity	Description	Solution	Impact
GaussDB	SYS.GAUSSDBV5	Process status alarm	ProcessStatusAlarm	Major	Key processes exit, including CMS/CMA, ETCD, GTM, CN, and DN processes.	Wait until the process is automatically recovered or a primary/standby failover is automatically performed. Check whether services are recovered. If no, contact SRE engineers.	If processes on primary nodes are faulty, services are interrupted and then rolled back. If processes on standby nodes are faulty, services are not affected.
		Component status alarm	ComponentStatusAlarm	Major	Key components do not respond, including CMA, ETCD, GTM, CN, and DN components.	Wait until the process is automatically recovered or a primary/standby failover is automatically performed. Check whether services are recovered. If no, contact SRE engineers.	If processes on primary nodes do not respond, neither do the services. If processes on standby nodes are faulty, services are not affected.
		Cluster status alarm	ClusterStatusAlarm	Major	The cluster status is abnormal. For example, the cluster is read-only; majority of ETCDs are faulty; or the cluster resources are unevenly distributed.	Contact SRE engineers.	If the cluster status is read-only, only read services are processed. If the majority of ETCDs are fault, the cluster is unavailable. If resources are unevenly distributed, the instance performance and reliability deteriorate.
		Hardware resource alarm	HardwareResourceAlarm	Major	A major hardware fault occurs in the instance, such as disk damage or GTM network fault.	Contact SRE engineers.	Some or all services are affected.
		Status transition alarm	StateTransitionAlarm	Major	The following events occur in the instance: DN build failure, forcible DN promotion, primary/standby DN switchover/failover, or primary/standby GTM switchover/failover.	Wait until the fault is automatically rectified and check whether services are recovered. If no, contact SRE engineers.	Some services are interrupted.
		Other abnormal alarm	OtherAbnormalAlarm	Major	Disk usage threshold alarm	Focus on service changes and scale up storage space as needed.	If the used storage space exceeds the threshold, storage space cannot be scaled up.
		DB instance creation failure	GaussDBV5CreateInstanceFailed	Major	Instances fail to be created because the quota is insufficient or underlying resources are exhausted.	Release the instances that are no longer used and try to provision them again, or submit a service ticket to adjust the quota.	DB instances cannot be created.
		Node adding failure	GaussDBV5ExpandClusterFailed	Major	The underlying resources are insufficient.	Submit a service ticket. The O&M personnel will coordinate resources in the background, and then you delete the node that failed to be added and add a new node.	None
		Storage scale-up failure	GaussDBV5EnlargeVolumeFailed	Major	The underlying resources are insufficient.	Submit a service ticket. The O&M personnel will coordinate resources in the background and then you scale up the storage space again.	Services may be interrupted.
		Reboot failure	GaussDBV5RestartInstanceFailed	Major	The network is abnormal.	Retry the reboot operation or submit a service ticket to the O&M personnel.	The database service may be unavailable.
		Full backup failure	GaussDBV5FullBackupFailed	Major	The backup files fail to be exported or uploaded.	Submit a service ticket to the O&M personnel.	Data cannot be backed up.
		Differential backup failure	GaussDBV5DifferentialBackupFailed	Major	The backup files fail to be exported or uploaded.	Submit a service ticket to the O&M personnel.	Data cannot be backed up.
		Backup deletion failure	GaussDBV5DeleteBackupFailed	Major	The backup files fail to be deleted from OBS.	Delete the backup files again.	None
		EIP binding failure	GaussDBV5BindEIPFailed	Major	The EIP is bound to another resource.	Submit a service ticket to the O&M personnel.	The instance cannot be accessed from the public network.
		EIP unbinding failure	GaussDBV5UnbindEIPFailed	Major	The network is faulty or EIP is abnormal.	Unbind the IP address again or submit a service ticket to the O&M personnel.	IP addresses may be residual.
		Parameter template application failure	GaussDBV5ApplyParamFailed	Major	Modifying a parameter template times out.	Modify the parameter template again.	None
		Parameter modification failure	GaussDBV5UpdateInstanceParamGroupFailed	Major	Modifying a parameter template times out.	Modify the parameter template again.	None
		Backup and restoration failure	GaussDBV5RestoreFromBcakupFailed	Major	The underlying resources are insufficient or backup files fail to be downloaded.	Submit a service ticket.	The database service may be unavailable during the restoration failure.
		Failed to upgrade the hot patch	GaussDBV5UpgradeHotfixFailed	Major	Generally, this fault is caused by an error reported during kernel upgrade.	View the error information about the workflow and redo or skip the job.	None
		DB instance faulty	GaussDBV5FaultyDBInstance	Major	This event is a key alarm event and is reported when an instance is faulty due to a disaster or a server failure.	Submit a service ticket.	The database service may be unavailable.
		DB instance recovered	GaussDBV5InstanceRecovered	Major	GaussDB provides an HA tool for automated or manual rectification of faults. After the fault is rectified, this event is reported.	No action is required.	None
		Faulty node	GaussDBV5FaultyDBNode	Major	This event is a key alarm event and is reported when a database node is faulty due to a disaster or a server failure.	This event is a key alarm event and is reported when a database node is faulty due to a disaster or a server failure.	The database service may be unavailable.
		Node recovered	GaussDBV5FaultyDBNodeRecovered	Major	GaussDB provides an HA tool for automated or manual rectification of faults. After the fault is rectified, this event is reported.	No action is required.	None

**Table 13** Distributed Database Middleware (DDM)
Event Source	Namespace	Event Name	Event ID	Event Severity	Description	Solution	Impact
DDM	SYS.DDM (DDM 1.0) SYS.DDMS (DDM 2.0)	Failed to create a DDM instance	createDdmInstanceFailed	Major	The underlying resources are insufficient.	Release resources and create the instance again.	DDM instances cannot be created.
		Failed to change class of a DDM instance	resizeFlavorFailed	Major	The underlying resources are insufficient.	Submit a service ticket to the O&M personnel to coordinate resources and try again.	Services on some nodes are interrupted.
		Failed to scale out a DDM instance	enlargeNodeFailed	Major	The underlying resources are insufficient.	Submit a service ticket to the O&M personnel to coordinate resources, delete the node that fails to be added, and add a node again.	The instance fails to be scaled out.
		Failed to scale in a DDM instance	reduceNodeFailed	Major	The underlying resources fail to be released.	Submit a service ticket to the O&M personnel to release resources.	The instance fails to be scaled in.
		Failed to restart a DDM instance	restartInstanceFailed	Major	The DB instances associated are abnormal.	Check whether DB instances associated are normal. If the instances are normal, submit a service ticket to the O&M personnel.	Services on some nodes are interrupted.
		Failed to create a schema	createLogicDbFailed	Major	The possible causes are as follows: The password for the DB instance account is incorrect. The security group of the DDM instance and the associated DB instance are incorrectly configured. As a result, the DDM instance cannot communicate with the associated DB instance.	Check whether The username and password of the DB instance are correct. The security groups associated with the DDM instance and underlying database instance are correctly configured.	Services cannot run properly.
		Failed to bind an EIP	bindEipFailed	Major	The EIP is abnormal.	Try again later. In case of emergency, contact O&M personnel to rectify the fault.	The DDM instance cannot be accessed from the Internet.
		Failed to scale out a schema	migrateLogicDbFailed	Major	The underlying resources fail to be processed.	Submit a service ticket to the O&M personnel.	The schema cannot be scaled out.
		Failed to re-scale out a schema	retryMigrateLogicDbFailed	Major	The underlying resources fail to be processed.	Submit a service ticket to the O&M personnel.	The schema cannot be scaled out.

**Table 14** Cloud Phone Server
Event Source	Namespace	Event Name	Event ID	Event Severity	Description	Solution	Impact
CPH	SYS.CPH	Server shutdown	cphServerOsShutdown	Major	The cloud phone server was stopped on the management console. by calling APIs.	Deploy service applications in HA mode. After the fault is rectified, check whether services recover.	Services are interrupted.
		Server abnormal shutdown	cphServerShutdown	Major	The cloud phone server was stopped unexpectedly. Possible causes are as follows: The cloud phone server was powered off unexpectedly. The cloud phone server was stopped due to hardware faults.	Deploy service applications in HA mode. After the fault is rectified, check whether services recover.	Services are interrupted.
		Server reboot	cphServerOsReboot	Major	The cloud phone server was rebooted on the management console. by calling APIs.	Deploy service applications in HA mode. After the fault is rectified, check whether services recover.	Services are interrupted.
		Server abnormal reboot	cphServerReboot	Major	The cloud phone server was rebooted unexpectedly due to OS faults. hardware faults.	Deploy service applications in HA mode. After the fault is rectified, check whether services recover.	Services are interrupted.
		Network disconnection	cphServerlinkDown	Major	The network where the cloud phone server was deployed was disconnected. Possible causes are as follows: The cloud phone server was stopped unexpectedly and rebooted. The switch was faulty. The gateway node was faulty.	Deploy service applications in HA mode. After the fault is rectified, check whether services recover.	Services are interrupted.
		PCIe error	cphServerPcieError	Major	The PCIe device or main board on the cloud phone server was faulty.	Deploy service applications in HA mode. After the fault is rectified, check whether services recover.	The network or disk read/write services are affected.
		Disk error	cphServerDiskError	Major	The disk on the cloud phone server was faulty due to disk backplane faults. disk faults.	Deploy service applications in HA mode. After the fault is rectified, check whether services recover.	Data read/write services are affected, or the BMS cannot be started.
		Storage error	cphServerStorageError	Major	The cloud phone server could not connect to EVS disks. Possible causes are as follows: The SDI card was faulty. Remote storage devices were faulty.	Deploy service applications in HA mode. After the fault is rectified, check whether services recover.	Data read/write services are affected, or the BMS cannot be started.
		GPU offline	cphServerGpuOffline	Major	GPU of the cloud phone server was loose and disconnected.	Stop the cloud phone server and reboot it.	Faults occur on cloud phones whose GPUs are disconnected. Cloud phones cannot run properly even if they are restarted or reconfigured.
		GPU timeout	cphServerGpuTimeOut	Major	GPU of the cloud phone server timed out.	Reboot the cloud phone server.	Cloud phones whose GPUs timed out cannot run properly and are still faulty even if they are restarted or reconfigured.
		Disk space full	cphServerDiskFull	Major	Disk space of the cloud phone server was used up.	Clear the application data in the cloud phone to release space.	Cloud phone is sub-healthy, prone to failure, and unable to start.
		Disk readonly	cphServerDiskReadOnly	Major	The disk of the cloud phone server became read-only.	Reboot the cloud phone server.	Cloud phone is sub-healthy, prone to failure, and unable to start.
		Cloud phone metadata damaged	cphPhoneMetaDataDamage	Major	Cloud phone metadata was damaged.	Contact O&M personnel.	The cloud phone cannot run properly even if it is restarted or reconfigured.
		GPU failed	gpuAbnormal	Critical	The GPU was faulty.	Submit a service ticket.	Services are interrupted.
		GPU recovered	gpuNormal	Informational	The GPU was running properly.	No further action is required.	None
		Kernel crash	kernelCrash	Critical	The kernel log indicated crash.	Submit a service ticket.	Services are interrupted during the crash.
		Kernel OOM	kernelOom	Major	The kernel log indicated out of memory.	Submit a service ticket.	Services are interrupted.
		Hardware malfunction	hardwareError	Critical	The kernel log indicated Hardware Error.	Submit a service ticket.	Services are interrupted.
		PCIe error	pcieAer	Critical	The kernel log indicated PCIe Bus Error.	Submit a service ticket.	Services are interrupted.
		SCSI error	scsiError	Critical	The kernel log indicated SCSI Error.	Submit a service ticket.	Services are interrupted.
		Image storage became read-only	partReadOnly	Critical	The image storage became read-only.	Submit a service ticket.	Services are interrupted.
		Image storage superblock damaged	badSuperBlock	Critical	The superblock of the file system of the image storage was damaged.	Submit a service ticket.	Services are interrupted.
		Image storage /.sharedpath/master became read-only	isuladMasterReadOnly	Critical	Mount point /.sharedpath/master of the image storage became read-only.	Submit a service ticket.	Services are interrupted.
		Cloud phone data disk became read-only	cphDiskReadOnly	Critical	The cloud phone data disk became read-only.	Submit a service ticket.	Services are interrupted.
		Cloud phone data disk superblock damaged	cphDiskBadSuperBlock	Critical	The superblock of the file system of the cloud phone data disk was damaged.	Submit a service ticket.	Services are interrupted.

**Table 15** Layer 2 Connection Gateway (L2CG)
Event Source	Namespace	Event Name	Event ID	Event Severity	Description	Solution	Impact
L2CG	SYS.ESW	IP addresses conflicted	IPConflict	Major	A cloud server and an on-premises server that need to communicate use the same IP address.	Check the ARP and switch information to locate the servers that have the same IP address and change the IP address.	The communications between the on-premises and cloud servers may be abnormal.

**Table 16** Virtual Private Cloud (VPC)
Event Source	Namespace	Event Name	Event ID	Event Severity
VPC	SYS.VPC	VPC deleted	deleteVpc	Major
		VPC modified	modifyVpc	Minor
		Subnet deleted	deleteSubnet	Minor
		Subnet modified	modifySubnet	Minor
		Bandwidth modified	modifyBandwidth	Minor
		VPN deleted	deleteVpn	Major
		VPN modified	modifyVpn	Minor

**Table 17** Elastic Volume Service (EVS)
Event Source	Namespace	Event Name	Event ID	Event Severity	Description	Solution	Impact
EVS	SYS.EVS	Update disk	updateVolume	Minor	Update the name and description of an EVS disk.	No further action is required.	None
		Expand disk	extendVolume	Minor	Expand an EVS disk.	No further action is required.	None
		Delete disk	deleteVolume	Major	Delete an EVS disk.	No further action is required.	Deleted disks cannot be recovered.
		QoS upper limit reached NOTE: This event is no longer supported for EVS and will be removed from Cloud Eye.	reachQoS	Major	The I/O latency increases as the QoS upper limits of the disk are frequently reached and flow control triggered.	Change the disk type to one with a higher specification.	The current disk may fail to meet service requirements.

**Table 18** Identity and Access Management (IAM)
Event Source	Namespace	Event Name	Event ID	Event Severity
IAM	SYS.IAM	Login	login	Minor
		Logout	logout	Minor
		Password changed	changePassword	Major
		User created	createUser	Minor
		User deleted	deleteUser	Major
		User updated	updateUser	Minor
		User group created	createUserGroup	Minor
		User group deleted	deleteUserGroup	Major
		User group updated	updateUserGroup	Minor
		Identity provider created	createIdentityProvider	Minor
		Identity provider deleted	deleteIdentityProvider	Major
		Identity provider updated	updateIdentityProvider	Minor
		Metadata updated	updateMetadata	Minor
		Security policy updated	updateSecurityPolicies	Major
		Credential added	addCredential	Major
		Credential deleted	deleteCredential	Major
		Project created	createProject	Minor
		Project updated	updateProject	Minor
		Project suspended	suspendProject	Major

**Table 19** Key Management Service (KMS)
Event Source	Namespace	Event Name	Event ID	Event Severity	Description	Solution	Impact
KMS	SYS.KMS	Key disabled	disableKey	Major	A key is disabled and cannot be used.	If the customer needs to disable the key, no action is required. However, if the key is disabled by mistake, the customer needs to log in to the DEW console and enable it again.	Services may be affected if the key is being used.
		Key deletion scheduled	scheduleKeyDeletion	Minor	A key is scheduled to be deleted and cannot be used.	If the customer needs to delete the key, no action is required. However, if the deletion of the key is scheduled by mistake, the customer needs to log in to the DEW console, cancel the scheduled deletion, and enable the key again.	Services may be affected if the key is being used.
		Grant retired	retireGrant	Major	A grant is retired and the key cannot be used.	If the customer needs to cancel the key grant, no action is required. However, if the grant is canceled by mistake, the customer needs to log in to the DEW console and create the grant again.	Services may be affected if the key is being used.
		Grant revoked	revokeGrant	Major	A grant is revoked and the key cannot be used.	If the customer needs to cancel the key grant, no action is required. However, if the grant is canceled by mistake, the customer needs to log in to the DEW console and create the grant again.	Services may be affected if the key is being used.

**Table 20** Object Storage Service (OBS)
Event Source	Namespace	Event Name	Event ID	Event Severity
OBS	SYS.OBS	Bucket deleted	deleteBucket	Major
		Bucket policy deleted	deleteBucketPolicy	Major
		Bucket ACL configured	setBucketAcl	Minor
		Bucket policy configured	setBucketPolicy	Minor

**Table 21** Cloud Eye
Event Source	Namespace	Event Name	Event ID	Event Severity	Description	Solution	Impact
Cloud Eye	SYS.CES	Agent heartbeat interruption	agentHeartbeatInterrupted	Major	The collecting process of the Agent is faulty.	Confirm that the Agent domain name cannot be resolved. Check whether your account is in arrears. The Agent process is faulty. Restart the Agent. If the Agent process is still faulty after the restart, the Agent files may be damaged. In this case, reinstall the Agent. Confirm that the server time is inconsistent with the local standard time. If the DNS server is not a Huawei Cloud DNS server, run the dig domain name command to obtain the IP address of agent.ces.myhuaweicloud.com which is resolved by the Huawei Cloud DNS server over the intranet and then add the IP address into the corresponding hosts file. Update the Agent to the latest version.	The Agent will stop collecting and reporting metrics.
		Agent back to normal	agentResumed	Informational	The Agent was back to normal.	No action is required.	None
		Agent faulty	agentFaulted	Major	The Agent was faulty and this status was reported to Cloud Eye.	The Agent process is faulty. Restart the Agent. If the Agent process is still faulty after the restart, the Agent files may be damaged. In this case, reinstall the Agent. Update the Agent to the latest version.	The Agent will stop collecting and reporting metrics.
		Agent disconnected	agentDisconnected	Major	The communication process of the Agent is faulty.	Confirm that the Agent domain name cannot be resolved. Check whether your account is in arrears. The Agent process is faulty. Restart the Agent. If the Agent process is still faulty after the restart, the Agent files may be damaged. In this case, reinstall the Agent. Confirm that the server time is inconsistent with the local standard time. If the DNS server is not a Huawei Cloud DNS server, run the dig domain-name command to obtain the IP address of agent.ces.myhuaweicloud.com which is resolved by the Huawei Cloud DNS server over the intranet, and then add the IP address into the corresponding hosts file. Update the Agent to the latest version.	The Agent will stop collecting and reporting metrics.

**Table 22** Enterprise Switch
Event Source	Namespace	Event Name	Event ID	Event Severity	Description	Solution	Impact
Enterprise Switch	SYS.ESW	IP addresses conflicted	IPConflict	Major	A cloud server and an on-premises server that need to communicate use the same IP address.	Check the ARP and switch information to locate the servers that have the same IP address and change the IP address.	The communications between the on-premises and cloud servers may be abnormal.

**Table 23** Cloud Secret Management Service (CSMS)
Event Source	Namespace	Event Name	Event ID	Event Severity	Description	Solution	Impact
CSMS	SYS.CSMS	Operation on secret scheduled for deletion	operateDeletedSecret	Major	A user attempts to perform operations on a secret that is scheduled to be deleted.	Check whether the scheduled secret deletion needs to be canceled.	The user cannot perform operations on the secret scheduled to be deleted.

**Table 24** Distributed Cache Service (DCS)
Event Source	Namespace	Event Name	Event ID	Event Severity	Description	Solution	Impact
DCS	SYS.DCS	Full sync retry during online migration	migrationFullResync	Minor	If online migration fails, full synchronization will be triggered because incremental synchronization cannot be performed.	Check whether full sync retries are triggered repeatedly. Check whether the source instance is connected and whether it is overloaded. If full sync retries are triggered repeatedly, contact O&M personnel.	The migration task is disconnected from the source instance, triggering another full sync. As a result, the CPU usage of the source instance may increase sharply.
		Automatic failover	masterStandbyFailover	Minor	The master node was abnormal, promoting a replica to master.	Check whether services can recover by themselves. If applications are not recovered, restart them.	Persistent connections to the instance are interrupted.
		Memcached master/standby switchover	memcachedMasterStandbyFailover	Minor	The master node was abnormal, promoting the standby node to master.	Check whether services can recover by themselves. If applications cannot recover, restart them.	Persistent connections to the instance will be interrupted.
		Redis server abnormal	redisNodeStatusAbnormal	Major	The Redis server status was abnormal.	Check whether services are affected. If yes, contact O&M personnel.	If the master node is abnormal, an automatic failover is performed. If a standby node is abnormal and the client directly connects to the standby node for read/write splitting, no data can be read.
		Redis server recovered	redisNodeStatusNormal	Major	The Redis server status recovered.	Check whether services can recover. If the applications are not reconnected, restart them.	Recover from an exception.
		Sync failure in data migration	migrateSyncDataFail	Major	Online migration failed.	Reconfigure the migration task and migrate data again. If the fault persists, contact O&M personnel.	Data migration fails.
		Memcached instance abnormal	memcachedInstanceStatusAbnormal	Major	The Memcached node status was abnormal.	Check whether services are affected. If yes, contact O&M personnel.	The Memcached instance is abnormal and may not be accessed.
		Memcached instance recovered	memcachedInstanceStatusNormal	Major	The Memcached node status recovered.	Check whether services can recover. If the applications are not reconnected, restart them.	Recover from an exception.
		Instance backup failure	instanceBackupFailure	Major	The DCS instance fails to be backed up due to an OBS access failure.	Retry backup manually.	Automated backup fails.
		Instance node abnormal restart	instanceNodeAbnormalRestart	Major	DCS nodes restarted unexpectedly when they became faulty.	Check whether services can recover by themselves. If applications cannot recover, restart them.	Persistent connections to the instance will be interrupted.
		Long-running Lua scripts stopped	scriptsStopped	Informational	Lua scripts that had timed out automatically stopped running.	Optimize Lua scrips to prevent execution timeout.	The execution of the lua scripts takes a long time and is forcibly interrupted. If the execution of the lua scripts takes a long time, the entire instance will be blocked.
		Node restarted	nodeRestarted	Informational	After write operations had been performed, the node automatically restarted to stop Lua scripts that had timed out.	Check whether services can recover by themselves. If applications cannot recover, restart them.	Persistent connections to the instance will be interrupted.
		Bandwidth scaling	bandwidthAutoScalingTriggered	Informational	Instance bandwidth used up.	Check the services on this instance.	A bandwidth increase incurs fees.
		Specification auto scaling triggered	specAutoScalingTriggered	Informational	Specifications auto scaling was triggered.	Check the services on this instance.	The instance specifications were used up, triggering auto scaling. The billing will be changed if the instance specifications are changed.
		Specifications scaled	specAutoScalingTriggeredSuccess	Informational	The instance specifications were scaled successfully.	Check the services on this instance.	Instance scaled up. Check its information.
		Scale specifications failed	specAutoScalingTriggeredFail	Critical	The instance specifications fail to be scaled.	Contact technical support.	Instance scaling failed. Log in to the console to check whether services are affected.

**Table 25** Intelligent Cloud Access (ICA)
Event Source	Namespace	Event Name	Event ID	Event Severity	Description	Solution	Impact
ICA	SYS.ICA	BGP peer disconnection	BgpPeerDisconnection	Major	The BGP peer is disconnected.	Log in to the gateway and locate the cause.	Service traffic may be interrupted.
		BGP peer connection success	BgpPeerConnectionSuccess	Major	The BGP peer is successfully connected.	None	None
		Abnormal GRE tunnel status	AbnormalGreTunnelStatus	Major	The GRE tunnel status is abnormal.	Log in to the gateway and locate the cause.	Service traffic may be interrupted.
		Normal GRE tunnel status	NormalGreTunnelStatus	Major	The GRE tunnel status is normal.	None	None
		WAN interface goes up	EquipmentWanGoingOnline	Major	The WAN interface goes online.	None	None
		WAN interface goes down	EquipmentWanGoingOffline	Major	The WAN interface goes offline.	Check whether the event is caused by a manual operation or device fault.	The device cannot be used.
		Intelligent enterprise gateway going online	IntelligentEnterpriseGatewayGoingOnline	Major	The intelligent enterprise gateway goes online.	None	None
		Intelligent enterprise gateway going offline	IntelligentEnterpriseGatewayGoingOffline	Major	The intelligent enterprise gateway goes offline.	Check whether the event is caused by a manual operation or device fault.	The device cannot be used.

**Table 26** Host Security Service (HSS)
Event Source	Namespace	Event Name	Event ID	Event Severity	Description	Solution	Impact
HSS	SYS.HSS	HSS agent disconnected	hssAgentAbnormalOffline	Major	The communication between the agent and the server is abnormal, or the agent process on the server is abnormal.	Fix your network connection. If the agent is still offline for a long time after the network recovers, the agent process may be abnormal. In this case, log in to the server and restart the agent process.	Services are interrupted.
HSS	SYS.HSS	Abnormal HSS agent status	hssAgentAbnormalProtection	Major	The agent is abnormal probably because it does not have sufficient resources.	Log in to the server and check your resources. If the usage of memory or other system resources is too high, increase their capacity first. If the resources are sufficient but the fault persists after the agent process is restarted, submit a service ticket to the O&M personnel.	Services are interrupted.

**Table 27** Cloud Storage Gateway (CSG)
Event Source	Namespace	Event Name	Event ID	Event Severity	Description
CSG	SYS.CSG	Abnormal CSG process status	gatewayProcessStatusAbnormal	Major	This event is triggered when an exception occurs in the CSG process status.
		Abnormal CSG connection status	gatewayToServiceConnectAbnormal	Major	This event is triggered when no CSG status report is returned for five consecutive periods.
		Abnormal connection status between CSG and OBS	gatewayToObsConnectAbnormal	Major	This event is triggered when CSG cannot connect to OBS.
		Read-only file system	gatewayFileSystemReadOnly	Major	This event is triggered when the partition file system on CSG becomes read-only.
		Read-only file share	gatewayFileShareReadOnly	Major	This event is triggered when the file share becomes read-only due to insufficient cache disk storage space.

**Table 28** Enterprise connection
Event Source	Namespace	Event Name	Event ID	Event Severity	Description	Solution	Impact
EC	SYS.EC	WAN interface goes up	EquipmentWanGoesOnline	Major	The WAN interface goes online.	None	None
		WAN interface goes down	EquipmentWanGoesOffline	Major	The WAN interface goes offline.	Check whether the event is caused by a manual operation or device fault.	The device cannot be used.
		BGP peer disconnection	BgpPeerDisconnection	Major	BGP peer disconnection	Check whether the event is caused by a manual operation or device fault.	The device cannot be used.
		BGP peer connection success	BgpPeerConnectionSuccess	Major	The BGP peer is successfully connected.	None	None
		Abnormal GRE tunnel status	AbnormalGreTunnelStatus	Major	Abnormal GRE tunnel status	Check whether the event is caused by a manual operation or device fault.	The device cannot be used.
		Normal GRE tunnel status	NormalGreTunnelStatus	Major	The GRE tunnel status is normal.	None	None
		Intelligent enterprise gateway going online	IntelligentEnterpriseGatewayGoesOnline	Major	The intelligent enterprise gateway goes online.	None	None
		Intelligent enterprise gateway going offline	IntelligentEnterpriseGatewayGoesOffline	Major	The intelligent enterprise gateway goes offline.	Check whether the event is caused by a manual operation or device fault.	The device cannot be used.

**Table 29** Cloud Certificate Manager (CCM)
Event Source	Namespace	Event Name	Event ID	Event Severity	Description	Solution	Impact
CCM	SYS.CCM	Certificate revocation	CCMRevokeCertificate	Major	The certificate enters into the revocation process. Once revoked, the certificate cannot be used anymore.	Check whether the certificate revocation is really needed. Certificate revocation can be canceled.	If a certificate is revoked, the website is inaccessible using HTTPS.
		Certificate auto-deployment failure	CCMAutoDeploymentFailure	Major	The certificate fails to be automatically deployed.	Check service resources whose certificates need to be replaced.	If no new certificate is deployed after a certificate expires, the website is inaccessible using HTTPS.
		Certificate expiration	CCMCertificateExpiration	Major	An SSL certificate has expired.	Purchase a new certificate in a timely manner.	If no new certificate is deployed after a certificate expires, the website is inaccessible using HTTPS.
		Certificate about to expire	CCMcertificateAboutToExpiration	Major	This alarm is generated when an SSL certificate is about to expire in one week, one month, and two months.	Renew or purchase a new certificate in a timely manner.	If no new certificate is deployed after a certificate expires, the website is inaccessible using HTTPS.
		Private certificate is about to expire	CCMPrivateCertificateAboutToExpiration	Major	A private certificate is considered about to expire if it is within 7 or 30 days of its expiration date.	Purchase a new private certificate in a timely manner.	If no new private certificate has been deployed before certificate expires, services may be interrupted.
		Private CA is about to expire	CCMPrivateCAAboutToExpiration	Major	A private CA is considered about to expire if it is within one month, three months, or six months of its expiration date.	Purchase a new private CA in a timely manner.	If no new CA has been deployed before a private CA expires, services may be interrupted.

**Table 30** Workspace
Event Source	Namespace	Event Name	Event ID	Event Severity	Description	Solution	Impact
Workspace	SYS.Workspace	Abnormal desktop heartbeat	desktopStatusAbnormal	Major	The network is disconnected or the key is lost.	Restart the desktop. Check whether the desktop time is the current time. If not, change the desktop time to the current time. Check whether special security software or network connection software is installed on the desktop. If so, uninstall the software and restart the system. Alternatively, uninstall the software, reinstall the HDCAgent, and restart the system.	The desktop cannot be accessed.
		Failure of assigning desktops in a desktop pool	desktopPoolAssignFailed	Major	This fault is caused by policies.	Adjust the desktop pool policy to ensure that there are idle desktops in the desktop pool or desktops can be automatically created. If Linux desktops cannot be assigned to users with digit-only usernames, enable the username prefix function.	New desktops cannot be assigned.
		Desktop access failure	desktopAccessFailed	Major	This fault is caused by VM stopping and restart, access gateway exceptions, or network faults.	If you stop or restart a VM, wait for a period of time and try again when the desktop status is normal. Check the network environment and reconnect to the network when the network is normal.	The desktop cannot be accessed.
		Desktop startup failure	desktopStartFailed	Major	The underlying resources are insufficient.	Wait for a while and try again.	The desktop cannot be accessed.
		Failure of automatic desktop pool capacity expansion	desktopPoolExpandFailed	Major	The instance quota or underlying resources are insufficient.	If the quota is insufficient, request a higher quota (such as the number of desktops, CPUs, memory, and VPCs). If underlying resources are insufficient, make purchases in the next capacity expansion period. If automatic desktop capacity expansion is not required, disable the function of automatic desktop pool capacity expansion.	Desktop capacity cannot be expanded.
		Failure of migrating a desktop running on a dedicated host	desktopMigrateFailed	Major	The host malfunctions.	Replace the faulty host with a normal one. Contact technical support to rectify the host fault.	No dedicated host is available for desktop scheduling.
		User login failure	userLoginFailed	Major	The client network is disconnected, or the enterprise ID, username, or password is incorrect.	Check the network environment and reconnect to the network when the network is normal. Check whether the enterprise ID, username, and password are valid.	Desktops or applications are unavailable.
		Screen recording failure	screenRecordFailed	Major	An unknown exception occurred on the desktop.	Try reconnecting to the desktop. Check whether special security software is installed on the desktop. If it is, uninstall the software and restart the system.	Screen recording malfunctions and the desktop is disconnected.
		Screen recording upload failure	screenRecordUploadFailed	Major	The network between the desktop and OBS malfunctions.	Check whether the desktop network is normal. Check whether security group interception has been configured. Check whether interception on access control with VPCEP and OBS has been configured.	Screen recording file upload failed.
		Damaged screen recording file	screenRecordFileDamaged	Major	The screen recording file was maliciously damaged.	Wait until the screen recording function is automatically restored. Check whether malicious damage occurs.	The screen recording file is abnormal.
		Abnormal agent process	agentAbnormal	Major	The agent process has been killed or reset.	The agent process can be automatically restarted after being killed.	Functions such as application control and upgrade will be affected.
		Bypassing controlled applications	appRestrictFailed	Major	The application control agent was continuously killed.	Check whether a script is used to continuously kill the application control agent.	Application control will fail.

Parent Topic: Event Monitoring

Previous topic: Creating an Alarm Rule and Notification for Event Monitoring

Next topic: Access Center

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.

The system is busy. Please try again later.