Events Supported by Event Monitoring_Appendix_API Reference (ME-Abu Dhabi Region)

**Table 1** Elastic Cloud Server (ECS)
Event Source	Event Name	Event ID	Event Severity	Description	Solution	Impact
ECS	Auto recovery timeout (being processed on the backend)	faultAutoRecovery	Major	Migrating the ECS to a normal host timed out.	Migrate services to other ECSs.	Services are interrupted.
	Restart triggered due to hardware fault	startAutoRecovery	Major	ECSs on a faulty host would be automatically migrated to another properly-running host. During the migration, the ECSs was restarted.	Wait for the event to end and check whether services are affected.	Services may be interrupted.
	Restart completed due to hardware failure	endAutoRecovery	Major	The ECS was recovered after the automatic migration.	This event indicates that the ECS has recovered and been working properly.	None
	GPU link fault	GPULinkFault	Critical	The GPU of the host running the ECS was faulty or was recovering from a fault.	Deploy service applications in HA mode. After the GPU fault is rectified, check whether services are restored.	Services are interrupted.
	FPGA link fault	FPGALinkFault	Critical	The FPGA of the host running the ECS was faulty or was recovering from a fault.	Deploy service applications in HA mode. After the FPGA fault is rectified, check whether services are restored.	Services are interrupted.
	ECS deleted	deleteServer	Major	The ECS was deleted on the management console. by calling APIs.	Check whether the deletion was performed intentionally by a user.	Services are interrupted.
	ECS restarted	rebootServer	Minor	The ECS was restarted on the management console. by calling APIs.	Check whether the restart was performed intentionally by a user. Deploy service applications in HA mode. After the ECS starts up, check whether services recover.	Services are interrupted.
	ECS stopped	stopServer	Minor	The ECS was stopped on the management console. by calling APIs. NOTE: The ECS is stopped only after CTS is enabled. For details, see Cloud Trace Service User Guide.	Check whether the restart was performed intentionally by a user. Deploy service applications in HA mode. After the ECS starts up, check whether services recover.	Services are interrupted.
	NIC deleted	deleteNic	Major	The ECS NIC was deleted on the management console. by calling APIs.	Check whether the deletion was performed intentionally by a user. Deploy service applications in HA mode. After the NIC is deleted, check whether services recover.	Services may be interrupted.
	ECS resized	resizeServer	Minor	The ECS was resized on the management console. by calling APIs.	Check whether the operation was performed by a user. Deploy service applications in HA mode. After the ECS is resized, check whether services have recovered.	Services are interrupted.
	GuestOS restarted	RestartGuestOS	Minor	The guest OS was restarted.	Contact O&M personnel.	Services may be interrupted.
	ECS failure due to abnormal host processes	VMFaultsByHostProcessExceptions	Critical	The processes of the host accommodating the ECS were abnormal.	Contact O&M personnel.	The ECS is faulty.
	Startup failure	faultPowerOn	Major	The ECS failed to start.	Start the ECS again. If the problem persists, contact O&M personnel.	The ECS cannot start.
	Host breakdown risk	hostMayCrash	Major	The host where the ECS resides may break down, and the risk cannot be prevented through live migration due to some reasons.	Migrate services running on the ECS first and delete or stop the ECS. Start the ECS only after the O&M personnel eliminate the risk.	The host may break down, causing service interruption.
	Live migration started	liveMigrationStarted	Major	The host where the ECS is located may be faulty. Live migrate the ECS in advance to prevent service interruptions caused by host breakdown.	Wait for the event to end and check whether services are affected.	Services may be interrupted for less than 1s.
	Live migration completed	liveMigrationCompleted	Major	The live migration is complete, and the ECS is running properly.	Check whether services are running properly.	None
	Live migration failure	liveMigrationFailed	Major	An error occurred during the live migration of an ECS.	Check whether services are running properly.	There is a low probability that services are interrupted.

**Table 2** Bare Metal Server (BMS)
Event Source	Event Name	Event ID	Event Severity	Description	Solution	Impact
BMS	BMS restarted	osReboot	Major	The BMS was restarted on the management console. by calling APIs.	Deploy service applications in HA mode. After the BMS is restarted, check whether services recover.	Services are interrupted.
	Unexpected restart	serverReboot	Major	The BMS restarted unexpectedly, which may be caused by OS faults. hardware faults.	Deploy service applications in HA mode. After the BMS is restarted, check whether services recover.	Services are interrupted.
	BMS stopped	osShutdown	Major	The BMS was stopped on the management console. by calling APIs.	Deploy service applications in HA mode. After the BMS is restarted, check whether services recover.	Services are interrupted.
	Unexpected shutdown	serverShutdown	Major	The BMS was stopped unexpectedly, which may be caused by unexpected power-off. hardware faults.	Deploy service applications in HA mode. After the BMS is restarted, check whether services recover.	Services are interrupted.
	Network disconnection	linkDown	Major	The BMS network was disconnected. Possible causes are as follows: The BMS was stopped or restarted unexpectedly. The switch was faulty. The gateway was faulty.	Deploy service applications in HA mode. After the BMS is restarted, check whether services recover.	Services are interrupted.
	PCIe error	pcieError	Major	The PCIe device or main board on the BMS was faulty, which may be caused by main board faults. PCIe device faults.	Deploy service applications in HA mode. After the BMS is started, check whether services recover.	The network or disk read/write services are affected.
	Disk fault	diskError	Major	The hard disk backplane or the hard disk on the BMS is faulty. Possible causes are as follows: Disk backplane faults Disk faults	Deploy service applications in HA mode. After the fault is rectified, check whether services recover.	Data read/write services are affected, or the BMS cannot be started.
	EVS error	storageError	Major	The BMS failed to connect to EVS disks. Possible causes are as follows: The SDI card was faulty. Remote storage devices were faulty.	Deploy service applications in HA mode. After the fault is rectified, check whether services recover.	Data read/write services are affected, or the BMS cannot be started.

**Table 3** Elastic IP (EIP)
Event Source	Event Name	Event ID	Event Severity	Description	Solution	Impact
EIP	EIP bandwidth exceeded	EIPBandwidthOverflow	Major	The used bandwidth exceeded the purchased one, which may slow down the network or cause packet loss. The value of this event is the maximum value in a monitoring period, and the value of the EIP inbound and outbound bandwidth is the value at a specific time point in the period. The metrics are described as follows: egressDropBandwidth: dropped outbound packets (bytes) egressAcceptBandwidth: accepted outbound packets (bytes) egressMaxBandwidthPerSec: peak outbound bandwidth (byte/s) ingressAcceptBandwidth: accepted inbound packets (bytes) ingressMaxBandwidthPerSec: peak inbound bandwidth (byte/s) ingressDropBandwidth: dropped inbound packets (bytes)	Check whether the EIP bandwidth keeps increasing and whether services are normal. Increase bandwidth if necessary.	The network becomes slow or packets are lost.
	EIP released	deleteEip	Minor	The EIP was released.	Check whether the EIP was release by mistake.	The server that has the EIP bound cannot access the Internet.
	EIP blocked	blockEIP	Critical	The used bandwidth of an EIP exceeded 5 Gbit/s, the EIP were blocked and packets were discarded. Such an event may be caused by DDoS attacks.	Replace the EIP to prevent services from being affected. Locate and deal with the fault.	Services are impacted.
	EIP unblocked	unblockEIP	Critical	The EIP was unblocked.	Use the previous EIP again.	None
	EIP traffic scrubbing started	ddosCleanEIP	Major	Traffic scrubbing on the EIP was started to prevent DDoS attacks.	Check whether the EIP was attacked.	Services may be interrupted.
	EIP traffic scrubbing ended	ddosEndCleanEip	Major	Traffic scrubbing on the EIP to prevent DDoS attacks was ended.	Check whether the EIP was attacked.	Services may be interrupted.
	QoS bandwidth exceeded	EIPBandwidthRuleOverflow	Major	The used QoS bandwidth exceeded the allocated one, which may slow down the network or cause packet loss. The value of this event is the maximum value in a monitoring period, and the value of the EIP inbound and outbound bandwidth is the value at a specific time point in the period. egressDropBandwidth: dropped outbound packets (bytes) egressAcceptBandwidth: accepted outbound packets (bytes) egressMaxBandwidthPerSec: peak outbound bandwidth (byte/s) ingressAcceptBandwidth: accepted inbound packets (bytes) ingressMaxBandwidthPerSec: peak inbound bandwidth (byte/s) ingressDropBandwidth: dropped inbound packets (bytes)	Check whether the EIP bandwidth keeps increasing and whether services are normal. Increase bandwidth if necessary.	The network becomes slow or packets are lost.

**Table 4** Elastic IP (EIP)
Event Source	Event Name	Event ID	Event Severity
EIP	EIP released	deleteEip	Minor

**Table 5** Advanced Anti-DDoS (AAD)
Event Source	Event Name	Event ID	Event Severity	Description	Solution	Impact
AAD	DDoS Attack Events	ddosAttackEvents	Major	A DDoS attack occurs in the AAD protected lines.	Judge the impact on services based on the attack traffic and attack type. If the attack traffic exceeds your purchased elastic bandwidth, change to another line or increase your bandwidth.	Services may be interrupted.
	Domain name scheduling event	domainNameDispatchEvents	Major	The high-defense CNAME corresponding to the domain name is scheduled, and the domain name is resolved to another high-defense IP address.	Pay attention to the workloads involving the domain name.	Services are not affected.
	Blackhole event	blackHoleEvents	Major	The attack traffic exceeds the purchased AAD protection threshold.	A blackhole is canceled after 30 minutes by default. The actual blackhole duration is related to the blackhole triggering times and peak attack traffic on the current day. The maximum duration is 24 hours. If you need to permit access before a blackhole becomes ineffective, contact technical support.	Services may be interrupted.
	Cancel Blackhole	cancelBlackHole	Informational	The customer's AAD instance recovers from the black hole state.	This is only a prompt and no action is required.	Customer services recover.

**Table 6** Cloud Backup and Recovery (CBR)
Event Source	Event Name	Event ID	Event Severity	Description	Solution	Impact
CBR	Failed to create the backup.	backupFailed	Critical	The backup failed to be created.	Manually create a backup or contact customer service.	Data loss may occur.
	Failed to restore the resource using a backup.	restorationFailed	Critical	The resource failed to be restored using a backup.	Restore the resource using another backup or contact customer service.	Data loss may occur.
	Failed to delete the backup.	backupDeleteFailed	Critical	The backup failed to be deleted.	Try again later or contact customer service.	Charging may be abnormal.
	Failed to delete the vault.	vaultDeleteFailed	Critical	The vault failed to be deleted.	Try again later or contact technical support.	Charging may be abnormal.
	Replication failure	replicationFailed	Critical	The backup failed to be replicated.	Try again later or contact technical support.	Data loss may occur.
	The backup is created successfully.	backupSucceeded	Major	The backup was created.	None	None
	Resource restoration using a backup succeeded.	restorationSucceeded	Major	The resource was restored using a backup.	Check whether the data is successfully restored.	None
	The backup is deleted successfully.	backupDeletionSucceeded	Major	The backup was deleted.	None	None
	The vault is deleted successfully.	vaultDeletionSucceeded	Major	The vault was deleted.	None	None
	Replication success	replicationSucceeded	Major	The backup was replicated successfully.	None	None
	Client offline	agentOffline	Critical	The backup client was offline.	Ensure that the Agent status is normal and the backup client can be connected to cloud service platform.	Backup tasks may fail.
	Client online	agentOnline	Major	The backup client was online.	None	None

**Table 7** Relational Database Service (RDS) — resource exception
Event Source	Event Name	Event ID	Event Severity	Description	Solution	Impact
RDS	DB instance creation failure	createInstanceFailed	Major	A DB instance fails to create because the number of disks is insufficient, the quota is insufficient, or underlying resources are exhausted.	Check the number and quota of disks. Release resources and create DB instances again.	DB instances cannot be created.
	Full backup failure	fullBackupFailed	Major	A single full backup failure does not affect the files that have been successfully backed up, but prolong the incremental backup time during the point-in-time restore (PITR).	Create a manual backup again.	Backup failed.
	Primary/standby switchover or failure	activeStandBySwitchFailed	Major	The standby DB instance does not take over workloads from the primary DB instance due to network or server failures. The original primary DB instance continues to provide workloads within a short time.	Check whether the connection between your application and the database is re-established.	None
	Replication status abnormal	abnormalReplicationStatus	Major	The possible causes are as follows: The replication delay between the primary and standby instances is too long, which usually occurs when a large amount of data is written to databases or a large transaction is processed. During peak hours, data may be blocked. The network between the primary and standby instances is disconnected.	Submit a service ticket.	Your applications are not affected because this event does not interrupt data read and write.
	Replication status recovered	replicationStatusRecovered	Major	The replication delay between the primary and standby instances is within the normal range, or the network connection between them has restored.	No action is required.	None
	DB instance faulty	faultyDBInstance	Major	A single or primary DB instance was faulty due to a disaster or a server failure.	Check whether an automated backup policy has been configured for the DB instance and submit a service ticket.	The database service may be unavailable.
	DB instance recovered	DBInstanceRecovered	Major	RDS rebuilds the standby DB instance with its high availability. After the instance is rebuilt, this event will be reported.	No action is required.	None
	Failure of changing single DB instance to primary/standby	singleToHaFailed	Major	A fault occurs when RDS is creating the standby DB instance or configuring replication between the primary and standby DB instances. The fault may occur because resources are insufficient in the data center where the standby DB instance is located.	Submit a service ticket.	Your applications are not affected because this event does not interrupt data read and write of the DB instance.
	Database process restarted	DatabaseProcessRestarted	Major	The database process is stopped due to insufficient memory or high load.	Log in to the Cloud Eye console. Check whether the memory usage increases sharply, the CPU usage is too high for a long time, or the storage space is insufficient. You can increase the CPU and memory specifications or optimize the service logic.	Downtime occurs. When this happens, RDS automatically restarts the database process and attempts to recover the workloads.
	Instance storage full	instanceDiskFull	Major	Generally, the cause is that the data space usage is too high.	Scale up the instance.	The DB instance becomes read-only because the storage space is full, and data cannot be written to the database.
	Instance storage full recovered	instanceDiskFullRecovered	Major	The instance disk is recovered.	No action is required.	The instance is restored and supports both read and write operations.
	Kafka connection failed	kafkaConnectionFailed	Major	The network is unstable or the Kafka server does not work properly.	Check your network connection and the Kafka server status.	Audit logs cannot be sent to the Kafka server.

**Table 8** Relational Database Service (RDS) — operations
Event Source	Event Name	Event ID	Event Severity	Description
RDS	Reset administrator password	resetPassword	Major	The password of the database administrator is reset.
	Operate DB instance	instanceAction	Major	The storage space is scaled or the instance class is changed.
	Delete DB instance	deleteInstance	Minor	The DB instance is deleted.
	Modify backup policy	setBackupPolicy	Minor	The backup policy is modified.
	Modify parameter group	updateParameterGroup	Minor	The parameter group is modified.
	Delete parameter group	deleteParameterGroup	Minor	The parameter group is deleted.
	Reset parameter group	resetParameterGroup	Minor	The parameter group is reset.
	Change database port	changeInstancePort	Major	The database port is changed.
	Primary/standby switchover or failover	PrimaryStandbySwitched	Major	A switchover or failover is performed.

**Table 9** Document Database Service (DDS)
Event Source	Event Name	Event ID	Event Severity	Description	Solution	Impact
DDS	DB instance creation failure	DDSCreateInstanceFailed	Major	A DDS instance fails to be created due to insufficient disks, quotas, and underlying resources.	Check the number and quota of disks. Release resources and create DDS instances again.	DDS instances cannot be created.
	Replication failed	DDSAbnormalReplicationStatus	Major	The possible causes are as follows: The replication delay between the primary and standby instances is too long, which usually occurs when a large amount of data is written to databases or a large transaction is processed. During off-peak hours, the replication delay gradually decreases. The network between the primary and standby instances is disconnected.	Submit a service ticket.	Your applications are not affected because this event does not interrupt data read and write.
	Replication recovered	DDSReplicationStatusRecovered	Major	The replication delay between the primary and standby instances is within the normal range, or the network connection between them has restored.	No action is required.	None
	DB instance failed	DDSFaultyDBInstance	Major	This event is a key alarm event and is reported when an instance is faulty due to a disaster or a server failure.	Submit a service ticket.	The database service may be unavailable.
	DB instance recovered	DDSDBInstanceRecovered	Major	If a disaster occurs, NoSQL provides an HA tool to automatically or manually rectify the fault. After the fault is rectified, this event is reported.	No action is required.	None
	Faulty node	DDSFaultyDBNode	Major	This event is a key alarm event and is reported when a database node is faulty due to a disaster or a server failure.	Check whether the database service is available and submit a service ticket.	The database service may be unavailable.
	Node recovered	DDSDBNodeRecovered	Major	If a disaster occurs, NoSQL provides an HA tool to automatically or manually rectify the fault. After the fault is rectified, this event is reported.	No action is required.	None
	Primary/standby switchover or failover	DDSPrimaryStandbySwitched	Major	A primary/standby switchover is performed or a failover is triggered.	No action is required.	None
	Insufficient storage space	DDSRiskyDataDiskUsage	Major	The storage space is insufficient.	Scale up storage space. For details, see section "Scaling Up Storage Space" in the corresponding user guide.	The instance is set to read-only and data cannot be written to the instance.
	Data disk expanded and being writable	DDSDataDiskUsageRecovered	Major	The capacity of a data disk has been expanded and the data disk becomes writable.	No action is required.	No adverse impact.

**Table 10** GaussDB NoSQL
Event Source	Event Name	Event ID	Event Severity	Description	Solution	Impact
GaussDB NoSQL	DB instance creation failed	NoSQLCreateInstanceFailed	Major	The instance quota or underlying resources are insufficient.	Release the instances that are no longer used and try to provision them again, or submit a service ticket to adjust the quota.	DB instances cannot be created.
	Specifications modification failed	NoSQLResizeInstanceFailed	Major	The underlying resources are insufficient.	Submit a service ticket. The O&M personnel will coordinate resources in the background, and then you need to change the specifications again.	Services are interrupted.
	Node adding failed	NoSQLAddNodesFailed	Major	The underlying resources are insufficient.	Submit a service ticket. The O&M personnel will coordinate resources in the background, and then you delete the node that failed to be added and add a new node.	None
	Node deletion failed	NoSQLDeleteNodesFailed	Major	The underlying resources fail to be released.	Delete the node again.	None
	Storage space scale-up failed	NoSQLScaleUpStorageFailed	Major	The underlying resources are insufficient.	Submit a service ticket. The O&M personnel will coordinate resources in the background and then you scale up the storage space again.	Services may be interrupted.
	Password reset failed	NoSQLResetPasswordFailed	Major	Resetting the password times out.	Reset the password again.	None
	Parameter group change failed	NoSQLUpdateInstanceParamGroupFailed	Major	Changing a parameter group times out.	Change the parameter group again.	None
	Backup policy configuration failed	NoSQLSetBackupPolicyFailed	Major	The database connection is abnormal.	Configure the backup policy again.	None
	Manual backup creation failed	NoSQLCreateManualBackupFailed	Major	The backup files fail to be exported or uploaded.	Submit a service ticket to the O&M personnel.	Data cannot be backed up.
	Automated backup creation failed	NoSQLCreateAutomatedBackupFailed	Major	The backup files fail to be exported or uploaded.	Submit a service ticket to the O&M personnel.	Data cannot be backed up.
	Faulty DB instance	NoSQLFaultyDBInstance	Major	This event is a key alarm event and is reported when an instance is faulty due to a disaster or a server failure.	Submit a service ticket.	The database service may be unavailable.
	DB instance recovered	NoSQLDBInstanceRecovered	Major	If a disaster occurs, NoSQL provides an HA tool to automatically or manually rectify the fault. After the fault is rectified, this event is reported.	No action is required.	None
	Faulty node	NoSQLFaultyDBNode	Major	This event is a key alarm event and is reported when a database node is faulty due to a disaster or a server failure.	Check whether the database service is available and submit a service ticket.	The database service may be unavailable.
	Node recovered	NoSQLDBNodeRecovered	Major	If a disaster occurs, NoSQL provides an HA tool to automatically or manually rectify the fault. After the fault is rectified, this event is reported.	No action is required.	None
	Primary/standby switchover or failover	NoSQLPrimaryStandbySwitched	Major	This event is reported when a primary/standby switchover is performed or a failover is triggered.	No action is required.	None
	HotKey occurred	HotKeyOccurs	Major	The primary key is improperly configured. As a result, hotspot data is distributed in one partition. The improper application design causes frequent read and write operations on a key.	1. Choose a proper partition key. 2. Add service cache. The service application reads hotspot data from the cache first.	The service request success rate is affected, and the cluster performance and stability also be affected.
	BigKey occurred	BigKeyOccurs	Major	The primary key design is improper. The number of records or data in a single partition is too large, causing unbalanced node loads.	1. Choose a proper partition key. 2. Add a new partition key for hashing data.	As the data in the large partition increases, the cluster stability deteriorates.
	Insufficient storage space	NoSQLRiskyDataDiskUsage	Major	The storage space is insufficient.	Scale up storage space. For details, see section "Scaling Up Storage Space" in the corresponding user guide.	The instance is set to read-only and data cannot be written to the instance.
	Data disk expanded and being writable	NoSQLDataDiskUsageRecovered	Major	The capacity of a data disk has been expanded and the data disk becomes writable.	No operation is required.	None
	Index creation failed	NoSQLCreateIndexFailed	Major	The service load exceeds what the instance specifications can take. In this case, creating indexes consumes more instance resources. As a result, the response is slow or even frame freezing occurs, and the creation times out.	Select the matched instance specifications based on the service load. Create indexes during off-peak hours. Create indexes in the background. Select indexes as required.	The index fails to be created or is incomplete. As a result, the index is invalid. Delete the index and create an index.
	Write speed decreased	NoSQLStallingOccurs	Major	The write speed is fast, which is close to the maximum write capability allowed by the cluster scale and instance specifications. As a result, the flow control mechanism of the database is triggered, and requests may fail.	1. Adjust the cluster scale or node specifications based on the maximum write rate of services. 2. Measures the maximum write rate of services.	The success rate of service requests is affected.
	Data write stopped	NoSQLStoppingOccurs	Major	The data write is too fast, reaching the maximum write capability allowed by the cluster scale and instance specifications. As a result, the flow control mechanism of the database is triggered, and requests may fail.	1. Adjust the cluster scale or node specifications based on the maximum write rate of services. 2. Measures the maximum write rate of services.	The success rate of service requests is affected.
	Database restart failed	NoSQLRestartDBFailed	Major	The instance status is abnormal.	Submit a service ticket to the O&M personnel.	The DB instance status may be abnormal.
	Restoration to new DB instance failed	NoSQLRestoreToNewInstanceFailed	Major	The underlying resources are insufficient.	Submit a service order to ask the O&M personnel to coordinate resources in the background and add new nodes.	Data cannot be restored to a new DB instance.
	Restoration to existing DB instance failed	NoSQLRestoreToExistInstanceFailed	Major	The backup file fails to be downloaded or restored.	Submit a service ticket to the O&M personnel.	The current DB instance may be unavailable.
	Backup file deletion failed	NoSQLDeleteBackupFailed	Major	The backup files fail to be deleted from OBS.	Delete the backup files again.	None
	Failed to enable Show Original Log	NoSQLSwitchSlowlogPlainTextFailed	Major	The DB engine does not support this function.	Refer to the GaussDB NoSQL User Guide to ensure that the DB engine supports Show Original Log. Submit a service ticket to the O&M personnel.	None
	EIP binding failed	NoSQLBindEipFailed	Major	The node status is abnormal, an EIP has been bound to the node, or the EIP to be bound is invalid.	Check whether the node is normal and whether the EIP is valid.	The DB instance cannot be accessed from the Internet.
	EIP unbinding failed	NoSQLUnbindEipFailed	Major	The node status is abnormal or the EIP has been unbound from the node.	Check whether the node and EIP status are normal.	None
	Parameter modification failed	NoSQLModifyParameterFailed	Major	The parameter value is invalid.	Check whether the parameter value is within the valid range and submit a service ticket to the O&M personnel.	None
	Parameter group application failed	NoSQLApplyParameterGroupFailed	Major	The instance status is abnormal. As a result, the parameter group cannot be applied.	Submit a service ticket to the O&M personnel.	None
	Failed to enable or disable SSL	NoSQLSwitchSSLFailed	Major	Enabling or disabling SSL times out.	Try again or submit a service ticket. Do not change the connection mode.	The connection mode cannot be changed.
	Row size too large	LargeRowOccurs	Major	Rows that are too large may result in query timeouts and other faults like an OOM error.	1. Control the length of each column and row so that the sum of key and value lengths in each row does not exceed the preset threshold. 2. Check whether there are invalid writes or encoding resulting in large keys or values.	If there are rows that are too large, the cluster performance will deteriorate as the data volume grows.

**Table 11** GaussDB(for MySQL)
Event Source	Event Name	Event ID	Event Severity	Description	Solution	Impact
GaussDB(for MySQL)	Incremental backup failure	TaurusIncrementalBackupInstanceFailed	Major	The network between the instance and the management plane (or the OBS) is disconnected, or the backup environment created for the instance is abnormal.	Submit a service ticket.	Backup jobs fail.
	Read replica creation failure	addReadonlyNodesFailed	Major	The quota is insufficient or underlying resources are exhausted.	Check the read replica quota. Release resources and create read replicas again.	Read replicas fail to be created.
	DB instance creation failure	createInstanceFailed	Major	The instance quota or underlying resources are insufficient.	Check the instance quota. Release resources and create instances again.	DB instances fail to be created.
	Read replica promotion failure	activeStandBySwitchFailed	Major	The read replica fails to be promoted to the primary node due to network or server failures. The original primary node takes over services quickly.	Submit a service ticket.	The read replica fails to be promoted to the primary node.
	Instance specifications change failure	flavorAlterationFailed	Major	The quota is insufficient or underlying resources are exhausted.	Submit a service ticket.	Instance specifications fail to be changed.
	Faulty DB instance	TaurusInstanceRunningStatusAbnormal	Major	The instance process is faulty or the communications between the instance and the DFV storage are abnormal.	Submit a service ticket.	Services may be affected.
	DB instance recovered	TaurusInstanceRunningStatusRecovered	Major	The instance is recovered.	Observe the service running status.	None
	Faulty node	TaurusNodeRunningStatusAbnormal	Major	The node process is faulty or the communications between the node and the DFV storage are abnormal.	Observe the instance and service running statuses.	A read replica may be promoted to the primary node.
	Node recovered	TaurusNodeRunningStatusRecovered	Major	The node is recovered.	Observe the service running status.	None
	Read replica deletion failure	TaurusDeleteReadOnlyNodeFailed	Major	The communications between the management plane and the read replica are abnormal or the VM fails to be deleted from IaaS.	Submit a service ticket.	Read replicas fail to be deleted.
	Password reset failure	TaurusResetInstancePasswordFailed	Major	The communications between the management plane and the instance are abnormal or the instance is abnormal.	Check the instance status and try again. If the fault persists, submit a service ticket.	Passwords fail to be reset for instances.
	DB instance reboot failure	TaurusRestartInstanceFailed	Major	The network between the management plane and the instance is abnormal or the instance is abnormal.	Check the instance status and try again. If the fault persists, submit a service ticket.	Instances fail to be rebooted.
	Restoration to new DB instance failure	TaurusRestoreToNewInstanceFailed	Major	The instance quota is insufficient, underlying resources are exhausted, or the data restoration logic is incorrect.	If the new instance fails to be created, check the instance quota, release resources, and try to restore to a new instance again. In other cases, submit a service ticket.	Backup data fails to be restored to new instances.
	EIP binding failure	TaurusBindEIPToInstanceFailed	Major	The binding task fails.	Submit a service ticket.	EIPs fail to be bound to instances.
	EIP unbinding failure	TaurusUnbindEIPFromInstanceFailed	Major	The unbinding task fails.	Submit a service ticket.	EIPs fail to be unbound from instances.
	Parameter modification failure	TaurusUpdateInstanceParameterFailed	Major	The network between the management plane and the instance is abnormal or the instance is abnormal.	Check the instance status and try again. If the fault persists, submit a service ticket.	Instance parameters fail to be modified.
	Parameter template application failure	TaurusApplyParameterGroupToInstanceFailed	Major	The network between the management plane and instances is abnormal or the instances are abnormal.	Check the instance status and try again. If the fault persists, submit a service ticket.	Parameter templates fail to be applied to instances.
	Full backup failure	TaurusBackupInstanceFailed	Major	The network between the instance and the management plane (or the OBS) is disconnected, or the backup environment created for the instance is abnormal.	Submit a service ticket.	Backup jobs fail.
	Primary/standby failover	TaurusActiveStandbySwitched	Major	When the network, physical machine, or database of the primary node is faulty, the system promotes a read replica to primary based on the failover priority to ensure service continuity.	Check whether the service is running properly. Check whether an alarm is generated, indicating that the read replica failed to be promoted to primary.	During the failover, database connection is interrupted for a short period of time. After the failover is complete, you can reconnect to the database.
	Database read-only	NodeReadonlyMode	Major	The database supports only query operations.	Submit a service ticket.	After the database becomes read-only, write operations cannot be processed.
	Database read/write	NodeReadWriteMode	Major	The database supports both write and read operations.	Submit a service ticket.	None.

**Table 12** GaussDB
Event Source	Event Name	Event ID	Event Severity	Description	Solution	Impact
GaussDB	Process status alarm	ProcessStatusAlarm	Major	Key processes exit, including CMS/CMA, ETCD, GTM, CN, and DN processes.	Wait until the process is automatically recovered or a primary/standby failover is automatically performed. Check whether services are recovered. If no, contact SRE engineers.	If processes on primary nodes are faulty, services are interrupted and then rolled back. If processes on standby nodes are faulty, services are not affected.
	Component status alarm	ComponentStatusAlarm	Major	Key components do not respond, including CMA, ETCD, GTM, CN, or DN component.	Wait until the process is automatically recovered or a primary/standby failover is automatically performed. Check whether services are recovered. If no, contact SRE engineers.	If processes on primary nodes do not respond, neither do the services. If processes on standby nodes are faulty, services are not affected.
	Cluster status alarm	ClusterStatusAlarm	Major	The cluster status is abnormal. For example, the cluster is read-only; majority of ETCDs are faulty; or the cluster resources are unevenly distributed.	Contact SRE engineers.	If the cluster status is read-only, only read services are processed. If the majority of ETCDs are fault, the cluster is unavailable. If resources are unevenly distributed, the instance performance and reliability deteriorate.
	Hardware resource alarm	HardwareResourceAlarm	Major	A major hardware fault occurs in the instance, such as disk damage or GTM network fault.	Contact SRE engineers.	Some or all services are affected.
	Status transition alarm	StateTransitionAlarm	Major	The following events occur in the instance: DN build failure, forcible DN promotion, primary/standby DN switchover/failover, or primary/standby GTM switchover/failover.	Wait until the fault is automatically rectified and check whether services are recovered. If no, contact SRE engineers.	Some services are interrupted.
	Other abnormal alarm	OtherAbnormalAlarm	Major	Disk usage threshold alarm	Focus on service changes and scale up storage space as needed.	If the used storage space exceeds the threshold, storage space cannot be scaled up.
	Faulty DB instance	TaurusInstanceRunningStatusAbnormal	Major	This event is a key alarm event and is reported when an instance is faulty due to a disaster or a server failure.	Submit a service ticket.	The database service may be unavailable.
	DB instance recovered	TaurusInstanceRunningStatusRecovered	Major	GaussDB(openGauss) provides an HA tool for automated or manual rectification of faults. After the fault is rectified, this event is reported.	No further action is required.	None
	Faulty DB node	TaurusNodeRunningStatusAbnormal	Major	This event is a key alarm event and is reported when a database node is faulty due to a disaster or a server failure.	Check whether the database service is available and submit a service ticket.	The database service may be unavailable.
	DB node recovered	TaurusNodeRunningStatusRecovered	Major	GaussDB(openGauss) provides an HA tool for automated or manual rectification of faults. After the fault is rectified, this event is reported.	No further action is required.	None
	DB instance creation failure	GaussDBV5CreateInstanceFailed	Major	Instances fail to be created because the quota is insufficient or underlying resources are exhausted.	Release the instances that are no longer used and try to provision them again, or submit a service ticket to adjust the quota.	DB instances cannot be created.
	Node adding failure	GaussDBV5ExpandClusterFailed	Major	The underlying resources are insufficient.	Submit a service ticket. The O&M personnel will coordinate resources in the background, and then you delete the node that failed to be added and add a new node.	None
	Storage scale-up failure	GaussDBV5EnlargeVolumeFailed	Major	The underlying resources are insufficient.	Submit a service ticket. The O&M personnel will coordinate resources in the background and then you scale up the storage space again.	Services may be interrupted.
	Reboot failure	GaussDBV5RestartInstanceFailed	Major	The network is abnormal.	Retry the reboot operation or submit a service ticket to the O&M personnel.	The database service may be unavailable.
	Full backup failure	GaussDBV5FullBackupFailed	Major	The backup files fail to be exported or uploaded.	Submit a service ticket to the O&M personnel.	Data cannot be backed up.
	Differential backup failure	GaussDBV5DifferentialBackupFailed	Major	The backup files fail to be exported or uploaded.	Submit a service ticket to the O&M personnel.	Data cannot be backed up.
	Backup deletion failure	GaussDBV5DeleteBackupFailed	Major	This function does not need to be implemented.	N/A	N/A
	EIP binding failure	GaussDBV5BindEIPFailed	Major	The EIP is bound to another resource.	Submit a service ticket to the O&M personnel.	The instance cannot be accessed from the Internet.
	EIP unbinding failure	GaussDBV5UnbindEIPFailed	Major	The network is faulty or EIP is abnormal.	Unbind the IP address again or submit a service ticket to the O&M personnel.	IP addresses may be residual.
	Parameter template application failure	GaussDBV5ApplyParamFailed	Major	Modifying a parameter template times out.	Modify the parameter template again.	None
	Parameter modification failure	GaussDBV5UpdateInstanceParamGroupFailed	Major	Modifying a parameter template times out.	Modify the parameter template again.	None
	Backup and restoration failure	GaussDBV5RestoreFromBcakupFailed	Major	The underlying resources are insufficient or backup files fail to be downloaded.	Submit a service ticket.	The database service may be unavailable during the restoration failure.

**Table 13** Distributed Database Middleware (DDM)
Event Source	Event Name	Event ID	Event Severity	Description	Solution	Impact
DDM	Failed to create a DDM instance	createDdmInstanceFailed	Major	The underlying resources are insufficient.	Release resources and create the instance again.	DDM instances cannot be created.
	Failed to change class of a DDM instance	resizeFlavorFailed	Major	The underlying resources are insufficient.	Submit a service ticket to the O&M personnel to coordinate resources and try again.	Services on some nodes are interrupted.
	Failed to scale out a DDM instance	enlargeNodeFailed	Major	The underlying resources are insufficient.	Submit a service ticket to the O&M personnel to coordinate resources, delete the node that fails to be added, and add a node again.	The instance fails to be scaled out.
	Failed to scale in a DDM instance	reduceNodeFailed	Major	The underlying resources fail to be released.	Submit a service ticket to the O&M personnel to release resources.	The instance fails to be scaled in.
	Failed to restart a DDM instance	restartInstanceFailed	Major	The DB instances associated are abnormal.	Check whether DB instances associated are normal. If the instances are normal, submit a service ticket to the O&M personnel.	Services on some nodes are interrupted.
	Failed to create a schema	createLogicDbFailed	Major	The possible causes are as follows: The DB instance account is incorrect. The DDM instance and its associated DB instances cannot communicate with each other because their security groups are not configured correctly.	Check the following items: Whether the DB instance account is correct. Whether the security groups associated with the DDM instance and its associated DB instance are correctly configured.	Services cannot run properly.
	Failed to bind an EIP	bindEipFailed	Major	The EIP is abnormal.	Try again later. In case of emergency, contact O&M personnel to rectify the fault.	The DDM instance cannot be accessed from the Internet.
	Failed to scale out a schema	migrateLogicDbFailed	Major	The underlying resources fail to be processed.	Submit a service ticket to the O&M personnel.	The schema cannot be scaled out.
	Failed to re-scale out a schema	retryMigrateLogicDbFailed	Major	The underlying resources fail to be processed.	Submit a service ticket to the O&M personnel.	The schema cannot be scaled out.

**Table 14** Cloud Phone
Event Source	Event Name	Event ID	Event Severity	Description	Solution	Impact
CPH	Server shutdown	cphServerOsShutdown	Major	The cloud phone server was shut down on the management console. by calling APIs.	Deploy service applications in HA mode. After the fault is rectified, check whether services recover.	Services are interrupted.
	Server abnormal shutdown	cphServerShutdown	Major	The cloud phone server was shut down unexpectedly. Possible causes are as follows: The cloud phone server was powered off unexpectedly. The cloud phone server was shut down due to hardware faults.	Deploy service applications in HA mode. After the fault is rectified, check whether services recover.	Services are interrupted.
	Server reboot	cphServerOsReboot	Major	The cloud phone server was rebooted on the management console. by calling APIs.	Deploy service applications in HA mode. After the fault is rectified, check whether services recover.	Services are interrupted.
	Server abnormal reboot	cphServerReboot	Major	The cloud phone server was rebooted unexpectedly due to OS faults. hardware faults.	Deploy service applications in HA mode. After the fault is rectified, check whether services recover.	Services are interrupted.
	Network disconnection	cphServerlinkDown	Major	The network where the cloud phone server was deployed was disconnected. Possible causes are as follows: The cloud phone server was shut down unexpectedly and rebooted. The switch was faulty. The gateway node was faulty.	Deploy service applications in HA mode. After the fault is rectified, check whether services recover.	Services are interrupted.
	PCIE error	cphServerPcieError	Major	The PCIe device or main board on the cloud phone server was faulty.	Deploy service applications in HA mode. After the fault is rectified, check whether services recover.	The network or disk read/write is affected.
	Disk error	cphServerDiskError	Major	The disk on the cloud phone server was faulty due to disk backplane faults. disk faults.	Deploy service applications in HA mode. After the fault is rectified, check whether services recover.	Data read/write services are affected, or the BMS cannot be started.
	Storage error	cphServerStorageError	Major	The cloud phone server could not connect to EVS disks. Possible causes are as follows: SDI card faults Remote storage devices were faulty.	Deploy service applications in HA mode. After the fault is rectified, check whether services recover.	Data read/write services are affected, or the BMS cannot be started.
	GPU offline	cphServerGpuOffline	Major	GPU of the cloud phone server was loose and disconnected.	Stop the cloud phone server and reboot it.	Faults occur on cloud phones whose GPUs are disconnected. Cloud phones cannot run properly even if they are restarted or reconfigured.
	GPU timeout	cphServerGpuTimeOut	Major	GPU of the cloud phone server timed out.	Reboot the cloud phone server.	Cloud phones whose GPUs timed out cannot run properly and are still faulty even if they are restarted or reconfigured.
	Disk space full	cphServerDiskFull	Major	Disk space of the cloud phone server was used up.	Clear the application data in the cloud phone to release space.	Cloud phone is sub-healthy, prone to failure, and unable to start.
	Disk readonly	cphServerDiskReadOnly	Major	The disk of the cloud phone server became read-only.	Reboot the cloud phone server.	Cloud phone is sub-healthy, prone to failure, and unable to start.
	Cloud phone metadata damaged	cphPhoneMetaDataDamage	Major	Cloud phone metadata was damaged.	Contact O&M personnel.	The cloud phone cannot run properly even if it is restarted or reconfigured.
	GPU failed	gpuAbnormal	Critical	The GPU was faulty.	Submit a service ticket.	Services are interrupted.
	GPU recovered	gpuNormal	Informational	The GPU was running properly.	No action is required.	N/A
	Kernel crash	kernelCrash	Critical	The kernel log indicated crash.	Submit a service ticket.	Services are interrupted during the crash.
	Kernel OOM	kernelOom	Major	The kernel log indicated out of memory.	Submit a service ticket.	Services are interrupted.
	Hardware malfunction	hardwareError	Critical	The kernel log indicated Hardware Error.	Submit a service ticket.	Services are interrupted.
	PCIE error	pcieAer	Critical	The kernel log indicated PCIE Bus Error.	Submit a service ticket.	Services are interrupted.
	SCSI error	scsiError	Critical	The kernel log indicated SCSI Error.	Submit a service ticket.	Services are interrupted.
	Image storage became read-only	partReadOnly	Critical	The image storage became read-only.	Submit a service ticket.	Services are interrupted.
	Image storage superblock damaged	badSuperBlock	Critical	The superblock of the file system of the image storage was damaged.	Submit a service ticket.	Services are interrupted.
	Image storage /.sharedpath/master became read-only	isuladMasterReadOnly	Critical	Mount point /.sharedpath/master of the image storage became read-only.	Submit a service ticket.	Services are interrupted.
	Cloud phone data disk became read-only	cphDiskReadOnly	Critical	The cloud phone data disk became read-only.	Submit a service ticket.	Services are interrupted.
	Cloud phone data disk superblock damaged	cphDiskBadSuperBlock	Critical	The superblock of the file system of the cloud phone data disk was damaged.	Submit a service ticket.	Services are interrupted.

**Table 15** Layer 2 Connection Gateway (L2CG)
Event Source	Event Name	Event ID	Event Severity	Description	Solution	Impact
L2CG	IP addresses conflicted	IPConflict	Major	A cloud server and an on-premises server that need to communicate use the same IP address.	Check the ARP and switch information to locate the servers that have the same IP address and change the IP address.	The communications between the on-premises and cloud servers may be abnormal.

**Table 16** Elastic IP and bandwidth
Event Source	Event Name	Event ID	Event Severity
Elastic IP and bandwidth	VPC deleted	deleteVpc	Major
	VPC modified	modifyVpc	Minor
	Subnet deleted	deleteSubnet	Minor
	Subnet modified	modifySubnet	Minor
	Bandwidth modified	modifyBandwidth	Minor
	VPN deleted	deleteVpn	Major
	VPN modified	modifyVpn	Minor

**Table 17** Elastic Volume Service (EVS)
Event Source	Event Name	Event ID	Event Severity	Description	Solution	Impact
EVS	Update disk	updateVolume	Minor	Update the name and description of an EVS disk.	No further action is required.	None
	Expand disk	extendVolume	Minor	Expand an EVS disk.	No further action is required.	None
	Delete disk	deleteVolume	Major	Delete an EVS disk.	No further action is required.	Deleted disks cannot be recovered.
	QoS upper limit reached	reachQoS	Major	The I/O latency increases as the QoS upper limits of the disk are frequently reached and flow control triggered.	Change the disk type to one with a higher specification.	The current disk may fail to meet service requirements.

**Table 18** Identity and Access Management (IAM)
Event Source	Event Name	Event ID	Event Severity
IAM	Login	login	Minor
	Logout	logout	Minor
	Password changed	changePassword	Major
	User created	createUser	Minor
	User deleted	deleteUser	Major
	User updated	updateUser	Minor
	User group created	createUserGroup	Minor
	User group deleted	deleteUserGroup	Major
	User group updated	updateUserGroup	Minor
	Identity provider created	createIdentityProvider	Minor
	Identity provider deleted	deleteIdentityProvider	Major
	Identity provider updated	updateIdentityProvider	Minor
	Metadata updated	updateMetadata	Minor
	Security policy updated	updateSecurityPolicies	Major
	Credential added	addCredential	Major
	Credential deleted	deleteCredential	Major
	Project created	createProject	Minor
	Project updated	updateProject	Minor
	Project suspended	suspendProject	Major

**Table 19** Data Encryption Workshop (DEW)
Event Source	Event Name	Event ID	Event Severity
DEW	Key disabled	disableKey	Major
	Key deletion scheduled	scheduleKeyDeletion	Minor
	Grant retired	retireGrant	Major
	Grant revoked	revokeGrant	Major

**Table 20** Object Storage Service (OBS)
Event Source	Event Name	Event ID	Event Severity
OBS	Bucket deleted	deleteBucket	Major
	Bucket policy deleted	deleteBucketPolicy	Major
	Bucket ACL configured	setBucketAcl	Minor
	Bucket policy configured	setBucketPolicy	Minor

**Table 21** Cloud Eye
Event Source	Event Name	Event Severity
Cloud Eye	Agent heartbeat interruption	Major

**Table 22** DataSpace
Event Source	Event Name	Event ID	Event Severity	Description	Solution	Impact
Data Space	New revision	newRevision	Minor	An updated version was released.	After receiving the notification, export the data of the updated version as required.	None.

**Table 23** Enterprise Switch
Event Source	Event Name	Event ID	Event Severity	Description	Solution	Impact
Enterprise Switch	IP addresses conflicted	IPConflict	Major	A cloud server and an on-premises server that need to communicate use the same IP address.	Check the ARP and switch information to locate the servers that have the same IP address and change the IP address.	The communications between the on-premises and cloud servers may be abnormal.

**Table 24** Distributed Cache Service (DCS)
Event Source	Event Name	Event ID	Event Severity	Description	Solution	Impact
DCS	Full synchronization during online migration retry	migrationFullResync	Minor	If online migration fails, full synchronization will be triggered because incremental synchronization cannot be performed.	Monitor the service volume and bandwidth usage. If the bandwidth usage is high and affects the service, manually stop the migration as required.	If the data volume is large, full synchronization may cause bandwidth usage to spike.
	Redis master/replica switchover	masterStandbyFailover	Minor	The master node was abnormal, promoting a replica to master.	Check the original master node and rectify the fault.	None
	Memcached master/standby switchover	memcachedMasterStandbyFailover	Minor	The master node was abnormal, promoting the standby node to master.	Check the original master node and rectify the fault.	None
	Redis server exception	redisNodeStatusAbnormal	Major	The Redis server status was abnormal.	Check the Redis server status.	The instance may become unavailable.
	Redis server recovered	redisNodeStatusNormal	Major	The Redis server status recovered.	None	None
	Synchronization failure in data migration	migrateSyncDataFail	Major	Online migration failed.	Check the network and the ECS service. If the ECS service is abnormal, a migration ECS cannot be created.	Data cannot be synchronized.
	Memcached instance abnormal	memcachedInstanceStatusAbnormal	Major	The Memcached node status was abnormal.	Check the Memcached node status.	The instance may become unavailable.
	Memcached instance recovered	memcachedInstanceStatusNormal	Major	The Memcached node status recovered.	None	None
	Instance backup failure	instanceBackupFailure	Major	The DCS instance fails to be backed up due to an OBS access failure.	Manually back up the instance again.	None
	Instance node abnormal restart	instanceNodeAbnormalRestart	Major	DCS nodes restarted unexpectedly when they became faulty.	Check whether services are normal.	Master/standby switchover may occur or access to Redis may fail.
	Long-running Lua scripts stopped	scriptsStopped	Informational	Lua scripts that had timed out automatically stopped running.	Do not run Lua scripts that take a long time.	Long-running Lua scripts cannot be completed.
	Node restarted	nodeRestarted	Informational	After write operations had been performed, the node automatically restarted to stop Lua scripts that had timed out.	Do not run Lua scripts that take a long time.	Temporary data is inconsistent between the restarted node and the master node during the restart.

**Table 25** Intelligent Cloud Access (ICA)
Event Source	Event Name	Event ID	Event Severity	Description	Solution	Impact
ICA	BGP peer disconnection	BgpPeerDisconnection	Major	The BGP peer is disconnected.	Log in to the gateway and locate the cause.	Service traffic may be interrupted.
	BGP peer connection success	BgpPeerConnectionSuccess	Major	The BGP peer is successfully connected.	None	None
	Abnormal GRE tunnel status	AbnormalGreTunnelStatus	Major	The GRE tunnel status is abnormal.	Log in to the gateway and locate the cause.	Service traffic may be interrupted.
	Normal GRE tunnel status	NormalGreTunnelStatus	Major	The GRE tunnel status is normal.	None	None
	WAN interface goes up	EquipmentWanGoingOnline	Major	The WAN interface goes online.	None	None
	WAN interface goes down	EquipmentWanGoingOffline	Major	The WAN interface goes offline.	Check whether the event is caused by a manual operation or device fault.	The device cannot be used.
	Intelligent enterprise gateway going online	IntelligentEnterpriseGatewayGoingOnline	Major	The intelligent enterprise gateway goes online.	None	None
	Intelligent enterprise gateway going offline	IntelligentEnterpriseGatewayGoingOffline	Major	The intelligent enterprise gateway goes offline.	Check whether the event is caused by a manual operation or device fault.	The device cannot be used.

**Table 26** Multi-Site High Availability Service (MAS)
Event Source	Event Name	Event ID	Event Severity	Description	Solution	Impact
MAS	Abnormal database instance	dbError	Major	Abnormal database instance is detected by MAS.	Log in to the MAS console to view the cause and rectify the fault.	Services are interrupted.
	Database instance recovered	dbRecovery	Major	The database instance is recovered.	N/A	Services are interrupted.
	Abnormal Redis instance	redisError	Major	Abnormal Redis instance is detected by MAS.	Log in to the MAS console to view the cause and rectify the fault.	Services are interrupted.
	Redis instance recovered	redisRecovery	Major	The Redis instance is recovered.	N/A	Services are interrupted.
	Abnormal MongoDB database	mongodbError	Major	Abnormal MongoDB database is detected by MAS.	Log in to the MAS console to view the cause and rectify the fault.	Services are interrupted.
	MongoDB database recovered	mongodbRecovery	Major	The MongoDB database is recovered.	N/A	Services are interrupted.
	Abnormal Elasticsearch instance	esError	Major	Abnormal Elasticsearch instance is detected by MAS.	Log in to the MAS console to view the cause and rectify the fault.	Services are interrupted.
	Elasticsearch instance recovered	esRecovery	Major	The Elasticsearch instance is recovered.	N/A	Services are interrupted.
	Abnormal API	apiError	Major	The abnormal API is detected by MAS.	Log in to the MAS console to view the cause and rectify the fault.	Services are interrupted.
	API recovered	apiRecovery	Major	The API is recovered.	N/A	Services are interrupted.
	Area status changed	netChange	Major	Area status changes are detected by MAS.	Log in to the MAS console to view the cause and rectify the fault.	Network of the multi-active areas may change.

**Table 27** Resource Management Service (RMS)
Event Source	Event Name	Event ID	Event Severity	Description	Solution	Impact
RMS	Configuration noncompliance notification	configurationNoncomplianceNotification	Major	The assignment evaluation result is Non-compliant.	Modify the noncompliant configuration items of the resource.	None
RMS	Configuration compliance notification	configurationComplianceNotification	Informational	The assignment evaluation result changed to be Compliant.	None	None

**Table 28** Cloud Storage Gateway (CSG)
Event Source	Event Name	Event ID	Event Severity	Description
CSG	Abnormal CSG process status	gatewayProcessStatusAbnormal	Major	This event is triggered when an exception occurs in the CSG process status.
	Abnormal CSG connection status	gatewayToServiceConnectAbnormal	Major	This event is triggered when no CSG status report is returned for five consecutive periods.
	Abnormal connection status between CSG and OBS	gatewayToObsConnectAbnormal	Major	This event is triggered when CSG cannot connect to OBS.
	Read-only file system	gatewayFileSystemReadOnly	Major	This event is triggered when the partition file system on CSG becomes read-only.
	Read-only file share	gatewayFileShareReadOnly	Major	This event is triggered when the file share becomes read-only due to insufficient cache disk storage space.

**Table 29** MapReduce Service (MRS)
Event Source	Event Name	Event ID	Event Severity	Description	Solution	Impact
MRS	DBServer Switchover	dbServerSwitchover	Minor	DBServer switchover occurs.	Confirm with O&M personnel whether the active/standby switchover is caused by normal operations.	Consecutive active/standby switchovers may affect Hive service availability.
	Flume Channel overflow	flumeChannelOverflow	Minor	Flume Channel overflow	Check whether the Flume channel configuration is proper and whether the service volume increases sharply.	Flume tasks cannot write data to the backend.
	NameNode Switchover	namenodeSwitchover	Minor	The NameNode switchover occurs.	Confirm with O&M personnel whether the active/standby switchover is caused by normal operations.	Consecutive active/standby switchovers may cause HDFS file read/write failures.
	ResourceManager Switchover	resourceManagerSwitchover	Minor	ResourceManager Switchover	Confirm with O&M personnel whether the active/standby switchover is caused by normal operations.	Consecutive active/standby switchovers may cause exceptions or even failures of YARN tasks.
	JobHistoryServer Switchover	jobHistoryServerSwitchover	Minor	The JobHistoryServer switchover occurs.	Confirm with O&M personnel whether the active/standby switchover is caused by normal operations.	Consecutive active/standby switchovers may cause failures to read MapReduce task logs.
	HMaster Failover	hmasterFailover	Minor	The HMaster failover occurs.	Confirm with O&M personnel whether the active/standby switchover is caused by normal operations.	Consecutive active/standby switchovers may affect HBase service availability.
	Hue Failover	hueFailover	Minor	The Hue failover occurs.	Confirm with O&M personnel whether the active/standby switchover is caused by normal operations.	The active/standby switchover may affect the display of the HUE page.
	Impala HaProxy Failover	impalaHaProxyFailover	Minor	The Impala HaProxy switchover occurs.	Confirm with O&M personnel whether the active/standby switchover is caused by normal operations.	Consecutive active/standby switchovers may affect Impala service availability.
	Impala StateStoreCatalog Failover	impalaStateStoreCatalogFailover	Minor	The Impala StateStoreCatalog failover occurs.	Confirm with O&M personnel whether the active/standby switchover is caused by normal operations.	Consecutive active/standby switchovers may affect Impala service availability.
	LdapServer Failover	ldapServerFailover	Minor	The LdapServer failover occur.	Confirm with O&M personnel whether the active/standby switchover is caused by normal operations.	Consecutive active/standby switchovers may affect LdapServer service availability.
	Loader Switchover	loaderSwitchover	Minor	The Loader switchover occur.	Confirm with O&M personnel whether the active/standby switchover is caused by normal operations.	The active/standby switchover may affect Loader service availability.
	Manager Switchover	managerSwitchover	Informational	The Manager switchover occurs.	Confirm with O&M personnel whether the active/standby switchover is caused by normal operations.	The active/standby Manager switchover may cause the Manager page inaccessible and abnormal values of some monitoring items.
	Job Running Failed	jobRunningFailed	Warning	A job fails to be executed.	On the Jobs tab page, check whether the failed task is normal.	The job fails to be executed.
	Job killed	jobkilled	Informational	The job is terminated.	Check whether the task is manually terminated.	The job execution process is terminated.
	Oozie Workflow Execution Failure	oozieWorkflowExecutionFailure	Minor	Oozie workflows fail to execute.	View Oozie logs to locate the failure cause.	Oozie workflows fail to execute.
	Oozie Scheduled Job Execution Failure	oozieScheduledJobExecutionFailure	Minor	Oozie scheduled tasks fail to execute.	View Oozie logs to locate the failure cause.	Oozie scheduled tasks fail to execute.
	ClickHouse service unavailable	clickHouseServiceUnavailable	Critical	The ClickHouse service is unavailable.	For details, see section "ALM-45425 ClickHouse Service Unavailable" in MapReduce Service User Guide.	The ClickHouse service is abnormal. Cluster operations cannot be performed on the ClickHouse service on FusionInsight Manager, and the ClickHouse service function cannot be used.
	DBService Service Unavailable	dbServiceServiceUnavailable	Critical	DBService is unavailable	For details, see section "ALM-27001 DBService Service Unavailable" in MapReduce Service User Guide.	The database service is unavailable and cannot provide data import and query functions for upper-layer services. As a result, service exceptions occur.
	DBService Heartbeat Interruption Between the Active and Standby Nodes	dbServiceHeartbeatInterruptionBetweentheActiveAndStandbyNodes	Major	DBService Heartbeat Interruption Between the Active and Standby Nodes	For details, see section "ALM-27003 Heartbeat Interruption Between the Active and Standby Nodes" in MapReduce Service User Guide.	During the DBService heartbeat interruption, only one node can provide the service. If this node is faulty, no standby node is available for failover and the service is unavailable.
	Data Inconsistency Between Active and Standby DBServices	dataInconsistencyBetweenActiveAndStandbyDBServices	Critical	Data Inconsistency Between Active and Standby DBServices	For details, see section "ALM-27004 Data Inconsistency Between Active and Standby DBService" in MapReduce Service User Guide.	When data is not synchronized between the active and standby DBServices, the data may be lost or abnormal if the active instance becomes abnormal.
	Database Enters the Read-Only Mode	databaseEnterstheReadOnlyMode	Critical	The database enters the read-only mode.	For details, see section "ALM-27007 Database Enters the Read-Only Mode" in MapReduce Service User Guide.	The database enters the read-only mode, causing service data loss.
	Flume Service Unavailable	flumeServiceUnavailable	Critical	Flume Service Unavailable	For details, see section "ALM-24000 Flume Service Unavailable" in MapReduce Service User Guide.	Flume is running abnormally and the data transmission service is interrupted.
	Flume Agent Exception	flumeAgentException	Major	Flume Agent Is Abnormal	For details, see section "ALM-24001 Flume Agent Exception" in MapReduce Service User Guide.	The Flume agent instance for which the alarm is generated cannot provide services properly, and the data transmission tasks of the instance are temporarily interrupted. Real-time data is lost during real-time data transmission.
	Flume Client Disconnection Alarm	flumeClientDisconnected	Major	Flume Client Disconnection Alarm	For details, see section "ALM-24003 Flume Client Interrupted" in MapReduce Service User Guide.	The Flume Client for which the alarm is generated cannot communicate with the Flume Server and the data of the Flume Client cannot be sent to the Flume Server.
	Exception Occurs When Flume Reads Data	exceptionOccursWhenFlumeReadsData	Major	Exceptions occur when flume reads data.	For details, see section "ALM-24004 Exception Occurs When Flume Reads Data" in MapReduce Service User Guide.	If data is found in the data source and Flume Source continuously fails to read data, the data collection is stopped.
	Exception Occurs When Flume Transmits Data	exceptionOccursWhenFlumeTransmitsData	Major	Exceptions occur when flume transmits data.	For details, see section "ALM-24005 Exception Occurs When Flume Transmits Data" in MapReduce Service User Guide.	If the disk usage of Flume Channel increases continuously, the time required for importing data to a specified destination prolongs. When the disk usage of Flume Channel reaches 100%, the Flume agent process pauses.
	Flume Certificate File is invalid	flumeCertificateFileIsinvalid	Major	The Flume certificate file is invalid or damaged.	For details, see section "ALM-24010 Flume Certificate File Is Invalid or Damaged" in MapReduce Service User Guide.	The Flume certificate file is invalid or damaged, and the Flume client cannot access the Flume server.
	Flume Certificate File is about to expire	flumeCertificateFileIsAboutToExpire	Major	The Flume certificate file is about to expire.	For details, see section "ALM-24011 Flume Certificate File Is About to Expire" in MapReduce Service User Guide.	The Flume certificate file is about to expire, which has no adverse impact on the system.
	Flume Certificate File is expired	flumeCertificateFileIsExpired	Major	The Flume certificate file has expired.	For details, see section "ALM-24012 Flume Certificate File Has Expired" in MapReduce Service User Guide.	The Flume certificate file has expired and functions are restricted. The Flume client cannot access the Flume server.
	Flume MonitorServer Certificate File is invalid	flumeMonitorServerCertificateFileIsInvalid	Major	The Flume MonitorServer certificate file is invalid.	For details, see section "ALM-24013 Flume MonitorServer Certificate File Is Invalid or Damaged" in MapReduce Service User Guide.	The MonitorServer certificate file is invalid or damaged, and the Flume client cannot access the Flume server.
	Flume MonitorServer Certificate File is about to expire	flumeMonitorServerCertificate FileIsAboutToExpire	Major	The Flume MonitorServer certificate file is about to expire.	For details, see section "ALM-24014 Flume MonitorServer Certificate Is About to Expire" in MapReduce Service User Guide.	The MonitorServer certificate is about to expire, which has no adverse impact on the system.
	Flume MonitorServer Certificate File is expired	flumeMonitorServerCertificateFileIsExpired	Major	The Flume MonitorServer certificate file has expired.	For details, see section "ALM-24015 Flume MonitorServer Certificate File Has Expired" in MapReduce Service User Guide.	The MonitorServer certificate file has expired and functions are restricted. The Flume client cannot access the Flume server.
	HDFS Service Unavailable	hdfsServiceUnavailable	Critical	The HDFS service is unavailable.	For details, see section "ALM-14000 HDFS Service Unavailable" in MapReduce Service User Guide.	HDFS fails to provide services for HDFS service-based upper-layer components, such as HBase and MapReduce. As a result, users cannot read or write files.
	NameService Service Unavailable	nameServiceServiceUnavailable	Major	The NameService service is abnormal.	For details, see section "ALM-14010 NameService Service Is Abnormal" in MapReduce Service User Guide.	HDFS fails to provide services for upper-layer components based on the NameService service, such as HBase and MapReduce. As a result, users cannot read or write files.
	DataNode Data Directory Is Not Configured Properly	datanodeDataDirectoryIsNotConfiguredProperly	Major	The DataNode data directory is not configured properly.	For details, see section "ALM-14011 DataNode Data Directory Is Not Configured Properly" in MapReduce Service User Guide.	If the DataNode data directory is mounted on critical directories such as the root directory, the disk space of the root directory will be used up after running for a long time. This causes a system fault. If the DataNode data directory is not configured properly, HDFS performance will deteriorate.
	Journalnode Is Out of Synchronization	journalnodeIsOutOfSynchronization	Major	The Journalnode data is not synchronized.	For details, see section "ALM-14012 JournalNode Is Out of Synchronization" in MapReduce Service User Guide.	When a JournalNode is working incorrectly, data on the node is not synchronized with that on other JournalNodes. If data on more than half of JournalNodes is not synchronized, the NameNode cannot work correctly, making the HDFS service unavailable.
	Failed to Update the NameNode FsImage File	failedToUpdateTheNameNodeFsImageFile	Major	The NameNode FsImage file failed to be updated.	For details, see section "ALM-14013 Failed to Update the NameNode FsImage File" in MapReduce Service User Guide.	If the FsImage file in the data directory of the active NameNode is not updated, the HDFS metadata combination function is abnormal and requires rectification. If it is not rectified, the Editlog files increase continuously after HDFS runs for a period. In this case, HDFS restart is time-consuming because a large number of Editlog files need to be loaded. In addition, this alarm also indicates that the standby NameNode is abnormal and the NameNode high availability (HA) mechanism becomes invalid. When the active NameNode is faulty, the HDFS service becomes unavailable.
	DataNode Disk Fault	datanodeDiskFault	Major	The DataNode disk is faulty.	For details, see section "ALM-14027 DataNode Disk Fault" in MapReduce Service User Guide.	If a DataNode disk fault alarm is reported, a faulty disk partition exists on the DataNode. As a result, files that have been written may be lost.
	Yarn Service Unavailable	yarnServiceUnavailable	Critical	The Yarn service is unavailable.	For details, see section "ALM-18000 Yarn Service Unavailable" in MapReduce Service User Guide.	The cluster cannot provide the Yarn service. Users cannot run new applications. Submitted applications cannot be run.
	NodeManager Heartbeat Lost	nodemanagerHeartbeatLost	Major	The NodeManager heartbeat is lost.	For details, see section "ALM-18002 NodeManager Heartbeat Lost" in MapReduce Service User Guide.	The lost NodeManager node cannot provide the Yarn service. The number of containers decreases, so the cluster performance deteriorates.
	NodeManager Unhealthy	nodemanagerUnhealthy	Major	The NodeManager is unhealthy.	For details, see section "ALM-18003 NodeManager Unhealthy" in MapReduce Service User Guide.	The faulty NodeManager node cannot provide the Yarn service. The number of containers decreases, so the cluster performance deteriorates.
	Yarn Application Timeout	yarnApplicationTimeout	Minor	Yarn task execution timed out.	For details, see section "ALM-18020 Yarn Task Execution Timeout" in MapReduce Service User Guide.	The alarm persists after task execution times out. However, the task can still be properly executed, so this alarm does not exert any impact on the system.
	MapReduce Service Unavailable	mapreduceServiceUnavailable	Critical	The MapReduce service is unavailable.	For details, see section "ALM-18021 MapReduce Service Unavailable" in MapReduce Service User Guide.	The cluster cannot provide the MapReduce service. For example, MapReduce cannot be used to view task logs and the log archive function is unavailable.
	Insufficient Yarn Queue Resources	insufficientYarnQueueResources	Minor	Yarn queue resources are insufficient.	For details, see section "ALM-18022 Insufficient Yarn Queue Resources" in MapReduce Service User Guide.	It takes long time to end an application. A new application cannot run for a long time after submission.
	HBase Service Unavailable	hbaseServiceUnavailable	Critical	The HBase service is unavailable.	For details, see section "ALM-19000 HBase Service Unavailable" in MapReduce Service User Guide.	Operations cannot be performed, such as reading or writing data and creating tables.
	System table path or file of HBase is missing	systemTablePathOrFileOfHBaseIsMissing	Critical	The table directories or files of the HBase System are lost.	For details, see section "ALM-19012 HBase System Table Directory or File Lost" in MapReduce Service User Guide.	The HBase service fails to restart or start.
	Hive Service Unavailable	hiveServiceUnavailable	Critical	The Hive service is unavailable.	For details, see section "ALM-16004 Hive Service Unavailable" in MapReduce Service User Guide.	Hive cannot provide data loading, query, and extraction services.
	Hive Data Warehouse Is Deleted	hiveDataWarehouseIsDeleted	Critical	The Hive data warehouse is deleted.	For details, see section "ALM-16045 Hive Data Warehouse Is Deleted" in MapReduce Service User Guide.	If the default Hive data warehouse is deleted, databases and tables fail to be created in the default data warehouse, affecting service usage.
	Hive Data Warehouse Permission Is Modified	hiveDataWarehousePermissionIsModified	Critical	The Hive data warehouse permissions are modified.	For details, see section "ALM-16046 Hive Data Warehouse Permission Is Modified" in MapReduce Service User Guide.	If the permissions on the Hive default data warehouse are modified, the permissions for users or user groups to create databases or tables in the default data warehouse are affected. The permissions will be expanded or reduced.
	HiveServer has been deregistered from zookeeper	hiveServerHasBeenDeregisteredFromZookeeper	Major	HiveServer has been deregistered from zookeeper.	For details, see section "ALM-16047 HiveServer Has Been Deregistered from ZooKeeper" in MapReduce Service User Guide.	If Hive configurations cannot be read from ZooKeeper, HiveServer will be unavailable.
	tezlib or sparklib does not exist	tezlibOrSparklibIsNotExist	Major	The tez or spark library path does not exist.	For details, see section "ALM-16048 Tez or Spark Library Path Does Not Exist" in MapReduce Service User Guide.	The Hive on Tez and Hive on Spark functions are affected.
	Hue Service Unavailable	hueServiceUnavailable	Critical	The Hue service is unavailable.	For details, see section "ALM-20002 Hue Service Unavailable" in MapReduce Service User Guide.	The system cannot provide data loading, query, and extraction services.
	Impala Service Unavailable	impalaServiceUnavailable	Critical	The Impala service is unavailable.	For details, see section "ALM-29000 Impala Service Unavailable" in MapReduce Service User Guide.	The Impala service is abnormal. Cluster operations cannot be performed on Impala on FusionInsight Manager, and Impala service functions cannot be used.
	Kafka Service Unavailable	kafkaServiceUnavailable	Critical	The Kafka service is unavailable.	For details, see section "ALM-38000 Kafka Service Unavailable" in MapReduce Service User Guide.	The cluster cannot provide the Kafka service, and users cannot perform new Kafka tasks.
	Status of Kafka Default User Is Abnormal	statusOfKafkaDefaultUserIsAbnormal	Critical	The status of Kafka default user is abnormal.	For details, see section "ALM-38007 Status of Kafka Default User Is Abnormal" in MapReduce Service User Guide.	If the Kafka default user status is abnormal, metadata synchronization between Brokers and interaction between Kafka and ZooKeeper will be affected, affecting service production, consumption, and topic creation and deletion.
	Abnormal Kafka Data Directory Status	abnormalKafkaDataDirectoryStatus	Major	The status of Kafka data directory is abnormal.	For details, see section "ALM-38008 Abnormal Kafka Data Directory Status" in MapReduce Service User Guide.	If the Kafka data directory status is abnormal, the current replicas of all partitions in the data directory are brought offline, and the data directory status of multiple nodes is abnormal at the same time. As a result, some partitions may become unavailable.
	Topics with Single Replica	topicsWithSingleReplica	Warning	A topic with a single replica exists.	For details, see section "ALM-38010 Topics with Single Replica" in MapReduce Service User Guide.	There is the single point of failure (SPOF) risk for topics with only one replica. When the node where the replica resides becomes abnormal, the partition does not have a leader, and services on the topic are affected.
	KrbServer Service Unavailable	krbServerServiceUnavailable	Critical	The KrbServer service is unavailable.	For details, see section "ALM-25500 KrbServer Service Unavailable" in MapReduce Service User Guide.	When this alarm is generated, no operation can be performed for the KrbServer component in the cluster. The authentication of KrbServer in other components will be affected. The running status of components that depend on KrbServer in the cluster is faulty.
	Kudu Service Unavailable	kuduServiceUnavailable	Critical	The Kudu service is unavailable.	For details, see section "ALM-29100 Kudu Service Unavailable" in MapReduce Service User Guide.	Users cannot use the Kudu service.
	LdapServer Service Unavailable	ldapServerServiceUnavailable	Critical	The LdapServer service Is unavailable.	For details, see section "ALM-25000 LdapServer Service Unavailable" in MapReduce Service User Guide.	When this alarm is generated, no operation can be performed for the KrbServer users and LdapServer users in the cluster. For example, users, user groups, or roles cannot be added, deleted, or modified, and user passwords cannot be changed on the FusionInsight Manager portal. The authentication for existing users in the cluster is not affected.
	Abnormal LdapServer Data Synchronization	abnormalLdapServerDataSynchronization	Critical	The LdapServer data synchronization is abnormal.	For details, see section "ALM-25004 Abnormal LdapServer Data Synchronization" in MapReduce Service User Guide.	LdapServer data inconsistency occurs because LdapServer data on Manager or in the cluster is damaged. The LdapServer process with damaged data cannot provide services externally, and the authentication functions of Manager and the cluster are affected.
	Nscd Service Is Abnormal	nscdServiceIsAbnormal	Major	The Nscd service is abnormal.	For details, see section "ALM-25005 nscd Service Exception" in MapReduce Service User Guide.	If the Nscd service is abnormal, the node may fail to synchronize data from LdapServer. In this case, running the id command may fail to obtain data from LdapServer, affecting upper-layer services.
	Sssd Service Is Abnormal	sssdServiceIsAbnormal	Major	The Sssd service is abnormal.	For details, see section "ALM-25006 Sssd Service Exception" in MapReduce Service User Guide.	If the Sssd service is abnormal, the node may fail to synchronize data from LdapServer. In this case, running the id command may fail to obtain LDAP data, affecting upper-layer services.
	Loader Service Unavailable	loaderServiceUnavailable	Critical	The Loader service is unavailable.	For details, see section "ALM-23001 Loader Service Unavailable" in MapReduce Service User Guide.	When the Loader service is unavailable, the data loading, import, and conversion functions are unavailable.
Oozie Service Unavailable	oozieServiceUnavailable	Critical	The Oozie service is unavailable.	For details, see section "ALM-17003 Oozie Service Unavailable" in MapReduce Service User Guide.	The Oozie service cannot be used to submit jobs.
Ranger Service Unavailable	rangerServiceUnavailable	Critical	The Ranger service is unavailable.	For details, see section "ALM-45275 Ranger Service Unavailable" in MapReduce Service User Guide.	When the Ranger service is unavailable, the Ranger cannot work properly and the native UI of the Ranger cannot be accessed.
Abnormal RangerAdmin status	abnormalRangerAdminStatus	Major	The RangerAdmin status is abnormal.	For details, see section "ALM-45276 Abnormal RangerAdmin Status" in MapReduce Service User Guide.	If the status of a single RangerAdmin is abnormal, the access to the Ranger native UI is not affected. If the status of two RangerAdmins is abnormal, the Ranger native UI cannot be accessed and operations such as creating, modifying, and deleting policies cannot be performed.
Spark2x Service Unavailable	spark2xServiceUnavailable	Critical	The Spark2x service is unavailable.	For details, see section "ALM-43001 Spark2x Service Unavailable" in MapReduce Service User Guide.	The Spark tasks submitted by users fail to be executed.
Storm Service Unavailable	stormServiceUnavailable	Critical	The Storm service is unavailable.	For details, see section "ALM-26051 Storm Service Unavailable" in MapReduce Service User Guide.	The cluster cannot provide the Storm service externally, and users cannot execute new Storm tasks.
ZooKeeper Service Unavailable	zooKeeperServiceUnavailable	Critical	The ZooKeeper service is unavailable.	For details, see section "ALM-13000 ZooKeeper Service Unavailable" in MapReduce Service User Guide.	ZooKeeper fails to provide coordination services for upper-layer components and the components depending on ZooKeeper may not run properly.
Failed to Set the Quota of Top Directories of ZooKeeper Component	failedToSetTheQuotaOfTopDirectoriesOfZooKeeperComponent	Minor	The quota of top directories of ZooKeeper components failed to be configured.	For details, see section "ALM-13005 Failed to Set the Quota of Top Directories of ZooKeeper Components" in MapReduce Service User Guide.	Components can write a large amount of data to the top-level directory of ZooKeeper. As a result, the ZooKeeper service is unavailable.

Events Supported by Event Monitoring

Feedback

Was this page helpful?