Help Center> Cloud Eye> User Guide> Event Monitoring> Events Supported by Event Monitoring

Events Supported by Event Monitoring

Table 1 ECS

Event Source

Event Name

Event Severity

Description

Solution

Impact

ECS

Start auto recovery

Major

If a host is faulty, ECSs on the host will be automatically migrated to another properly-running host. During the migration, the ECSs will be restarted.

Wait for the event to end and check whether services are affected.

Services may be interrupted.

Stop auto recovery

Major

After the automatic migration is complete, the ECS will be restored.

This event indicates that the ECS has restored to normal and is working properly.

No impact.

Auto recovery timeout (being processed on the backend)

Major

The operation of migrating the ECS to a normal host times out.

Migrate services to other ECSs.

Services are interrupted.

GPU link fault

Critical

The GPU of the host running the ECS is faulty or is recovering from a fault.

Deploy service applications in HA mode.

After the GPU fault is rectified, check whether services are restored.

Services are interrupted.

FPGA link fault

Critical

The FPGA of the host running the ECS is faulty or is recovering from a fault.

Deploy service applications in HA mode.

After the FPGA fault is rectified, check whether services are restored.

Services are interrupted.

Improper ECS running

Major

The ECS is faulty, or the NIC is abnormal, causing the ECS to run abnormally.

Deploy service applications in HA mode.

After the fault is rectified, check whether services recover.

Services are interrupted.

Improper ECS running recovered

Major

The ECS has restored to the normal status.

Wait for the ECS status to become normal and check whether services are affected.

No impact.

Delete ECS

Major

The ECS is deleted

  • on the management console
  • by calling APIs

Check whether the deletion was performed intentionally by a user.

Services are interrupted.

Reboot ECS

Minor

The ECS is restarted

  • on the management console
  • by calling APIs

Check whether the restart was performed intentionally by a user.

  • Deploy service applications in HA mode.
  • After the ECS starts up, check whether services recover.

Services are interrupted.

Stop ECS

Minor

The ECS is stopped

  • on the management console
  • by calling APIs
  • Check whether the stop operation was performed intentionally by a user.
  • Deploy service applications in HA mode.
  • After the ECS is started, check whether services recover.

Services are interrupted.

Delete NIC

Major

The ECS NIC is deleted

  • on the management console
  • by calling APIs
  • Check whether the deletion was performed intentionally by a user.
  • Deploy service applications in HA mode.
  • After the NIC is deleted, check whether services recover.

Services may be interrupted.

Modify ECS specifications

Minor

The ECS specifications have been modified

  • on the management console
  • by calling APIs
  • Check whether the operation is performed by a user.
  • Deploy service applications in HA mode.
  • After the specifications are modified, check whether services have recovered.

Services are interrupted.

Once a physical host running ECSs breaks down, the ECSs automatically migrate to a functional physical host. The ECSs will restart during the migration.

Table 2 BMS

Event Source

Event Name

Event Severity

Description

Solution

Impact

BMS

Reboot BMS

Major

The BMS is restarted

  • on the management console
  • by calling APIs
  • Deploy service applications in HA mode.
  • After the BMS is restarted, check whether services recover.

Services are interrupted.

Unexpected restart

Major

The BMS restarts unexpectedly, which may be caused by

  • OS fault
  • Hardware fault
  • Deploy service applications in HA mode.
  • After the BMS is restarted, check whether services recover.

Services are interrupted.

Stop BMS

Major

The BMS is stopped

  • on the management console
  • by calling APIs
  • Deploy service applications in HA mode.
  • After the BMS is started, check whether services recover.

Services are interrupted.

Unexpected shutdown

Major

The BMS stops unexpectedly, which may be caused by

  • unexpected power-off
  • hardware fault
  • Deploy service applications in HA mode.
  • After the BMS is started, check whether services recover.

Services are interrupted.

Network interruption

Major

The BMS network is interrupted, which may because

  • The BMS is unexpectedly stopped or restarted.
  • The switch is faulty.
  • The gateway is faulty.
  • Deploy service applications in HA mode.
  • After the BMS is started, check whether services recover.

Services are interrupted.

PCIe error

Major

The PCIe device or main board on the BMS is faulty.

  • Deploy service applications in HA mode.
  • After the BMS is recovered, check whether services recover.

The network or disk read/write services are affected.

Disk fault

Major

The disk backplane or the disk on the BMS is faulty.

  • Deploy service applications in HA mode.
  • After the fault is rectified, check whether services recover.

Data read/write services are affected, or the BMS cannot be started.

EVS error

Major

The BMS fails to connect to the EVS disk, which may because

  • The SDI interface is faulty.
  • The remote storage device is faulty.
  • Deploy service applications in HA mode.
  • After the fault is rectified, check whether services recover.

Data read/write services are affected, or the BMS cannot be started.

Table 3 EIP

Event Source

Event Name

Event Severity

Description

Solution

Impact

EIP

EIP bandwidth overflow

Major

The required bandwidth exceeds the purchased bandwidth, which may slow down the network or cause packet loss.

NOTE:

EIP bandwidth overflow is only launched in the CN North-Beijing1, CN East-Shanghai1, CN East-Shanghai2, and CN South-Guangzhou regions.

Check whether the EIP bandwidth keeps increasing and whether services are normal. Expand capacity if necessary.

The network becomes slow or packets are lost.

Release EIP

Minor

The EIP is deleted.

Check whether the resource is deleted by mistake.

The server cannot access the Internet.

EIP blocked

Critical

If the required bandwidth exceeds 5 GB, packets will be discarded. It may be caused by DDoS attacks.

Replace the EIP to prevent services from being affected.

Locate and deal with the fault.

Services are impacted.

EIP unblocked

Critical

The EIP has been unblocked.

Use the original EIP again.

No impact.

Table 4 CBR

Event Source

Event Name

Event Severity

Description

Solution

Impact

CBR

Backup failed

Critical

Failed to create the backup.

Manually create a backup or contact customer service.

Data loss may occur.

Restoration failed

Critical

Failed to restore the resource using a backup.

Use other backups to restore the resource or contact customer service.

Data loss may occur.

Backup deletion failed

Critical

Failed to delete the backup.

Try again later or contact customer service.

Charging may be abnormal.

Vault deletion failed

Critical

Failed to delete the vault.

Try again later or contact customer service.

Charging may be abnormal.

Replication failed

Critical

Failed to replicate the backup.

Try again later or contact customer service.

Data loss may occur.

Backup succeeded

Major

The backup is created successfully.

None

No impact.

Restoration succeeded

Major

Resource restoration using a backup succeeded.

Check whether the data is successfully restored.

No impact.

Backup deletion succeeded

Major

The backup is deleted successfully.

None

No impact.

Vault deletion succeeded

Major

The vault is deleted successfully.

None

No impact.

Replication succeeded

Major

The backup is replicated successfully.

None

No impact.

Table 5 RDS — resource exception

Event Source

Event Name

Event Severity

Description

Solution

Impact

RDS

DB instance creation failure

Major

A DB instance fails to create because the number of disks is insufficient, the quota is insufficient, or underlying resources are exhausted.

Check the number and quota of disks. Release resources and create DB instances again.

DB instances cannot be created.

Full backup failure

Major

A single full backup failure does not affect the files that have been successfully backed up, but prolong the incremental backup time during the point-in-time restore (PITR).

Create a manual backup again.

Backup failed.

Primary/secondary switchover failure

Major

The standby DB instance does not take over services from the primary DB instance due to network or server failures. The original primary DB instance continues to provide services within a short time.

Check whether the connection between the application and the database is re-established.

No impact.

Abnormal replication status

Major

The possible causes are as follows:

1. The replication delay between the primary and standby DB instances is too long (usually occurs when a large amount of data is written to databases or a large transaction is performed). During peak hours, data may be blocked.

2. The network between the primary and secondary DB instances is disconnected.

Submit a service ticket.

This event does not interrupt data read and write of the DB instance, and your applications are not affected.

Replication status recovered

Major

The replication delay between the primary and standby DB instances is within the normal range, or the network connection between them has restored.

No action is required.

No impact.

Faulty DB instance

Major

A single or primary DB instance is faulty due to a disaster or a server failure.

Check whether an automatic backup policy has been configured for the DB instance and submit a service ticket.

The database service may be unavailable.

DB instance recovered

Major

RDS uses high availability tools to rebuild the standby DB instance for disaster recovery. After the recovery, this event will be reported.

No action is required.

No impact.

Changing to primary/secondary DB instances failure

Major

A fault occurs when you create the standby DB instance or configure synchronization between the primary and standby DN instances. The fault may because resources are insufficient in the data center where the standby DB instance is located.

Submit a service ticket.

This event does not interrupt data read and write of the DB instance, and your applications are not affected.

Table 6 RDS — operations

Event Source

Event Name

Event Severity

RDS

Reset administrator password

Major

Operate DB instance

Major

Delete DB instance

Minor

Modify backup policy

Minor

Change parameter group

Minor

Delete parameter group

Minor

Reset parameter group

Minor

Change database port

Major

Table 7 DDS

Event Source

Event Name

Event Severity

Description

Solution

Impact

DDS

DB instance creation failure

Major

A DB instance fails to create because the number of disks is insufficient, the quota is insufficient, or underlying resources are exhausted.

Check the number and quota of disks. Release resources and create DB instances again.

DB instances cannot be created.

Abnormal replication status

Major

The possible causes are as follows:

1. The replication delay between the primary and standby DB instances is too long (usually occurs when a large amount of data is written to databases or a large transaction is performed). During off-peak hours, the replication delay between the primary and standby DB instances gradually decreases.

2. The network between the primary and secondary DB instances is disconnected.

Submit a service ticket.

This event does not interrupt data read and write of the DB instance, and your applications are not affected.

Replication status recovered

Major

The replication delay between the primary and standby DB instances is within the normal range, or the network connection between them has restored.

No action is required.

No impact.

Faulty DB instance

Major

This event is a key alarm event and is reported when an instance is faulty due to a disaster or a server failure.

Submit a service ticket.

The database service may be unavailable.

DB instance recovered

Major

If a disaster occurs, NoSQL provides an HA tool to automatically or manually rectify the fault. After the fault is rectified, this event is reported.

No action is required.

No impact.

Faulty node

Major

This event is a key alarm event and is reported when a database node is faulty due to a disaster or a server failure.

Check whether the database service is available and submit a service ticket.

The database service may be unavailable.

Node recovered

Major

If a disaster occurs, NoSQL provides an HA tool to automatically or manually rectify the fault. After the fault is rectified, this event is reported.

No action is required.

No impact.

Primary/standby switchover or failover

Major

This event is reported when a primary/standby switchover or a failover is triggered.

No action is required.

No impact.

Table 8 GaussDB NoSQL

Event Source

Event Name

Event Severity

Description

Solution

Impact

NoSQL

Failed to create a DB instance

Major

The DB instance quota or underlying resources are insufficient.

Release the instances that are no longer used and try to provision them again, or submit a service ticket to adjust the quota upper limit.

DB instances cannot be created.

Failed to modify the specifications

Major

The underlying resources are insufficient.

Submit a service ticket. The O&M personnel will coordinate resources in the background, and then you need to change the specifications again.

Services are interrupted.

Failed to add a node

Major

The underlying resources are insufficient.

Submit a service ticket. The O&M personnel will coordinate resources in the background, and then you delete the node that fails to be added and add a new node.

No impact.

Failed to delete a node

Major

The underlying resources fail to be released.

Delete the node again.

No impact.

Failed to scale up the storage space

Major

The underlying resources are insufficient.

Submit a service ticket. The O&M personnel will coordinate resources in the background and then you scale up the storage space again.

Services may be interrupted.

Failed to reset the password

Major

Resetting the password times out.

Reset the password again.

No impact.

Failed to modify a parameter group

Major

Modifying a parameter group times out.

Modify the parameter group again.

No impact.

Failed to set the backup policy

Major

The database connection is abnormal.

Set the backup policy again.

No impact.

Failed to create a manual backup

Major

The backup files fail to be exported or uploaded.

Submit a service ticket to the O&M personnel.

Data cannot be backed up.

Failed to create an automated backup

Major

The backup files fail to be exported or uploaded.

Submit a service ticket to the O&M personnel.

Data cannot be backed up.

Faulty DB instance

Major

This event is a key alarm event and is reported when an instance is faulty due to a disaster or a server failure.

Submit a service ticket.

The database service may be unavailable.

DB instance recovered

Major

If a disaster occurs, NoSQL provides an HA tool to automatically or manually rectify the fault. After the fault is rectified, this event is reported.

No action is required.

No impact.

Faulty node

Major

This event is a key alarm event and is reported when a database node is faulty due to a disaster or a server failure.

Check whether the database service is available and submit a service ticket.

The database service may be unavailable.

Node recovered

Major

If a disaster occurs, NoSQL provides an HA tool to automatically or manually rectify the fault. After the fault is rectified, this event is reported.

No action is required.

No impact.

Primary/standby switchover or failover

Major

This event is reported when a primary/standby switchover or a failover is triggered.

No action is required.

No impact.

HotKeyOccurs

Major

Objectively, the primary key is improperly set. As a result, hotspot data is distributed in one partition. The improper application design, which causes frequent read and write operations on a key.

1. Choose a proper partition key.

2. Add service cache. The service application reads hotspot data from the cache firstly.

The service request success rate is affected, and the cluster performance and stability also be affected.

BigKeyOccurs

Major

The primary key design is improper. The number of records or data in a single partition is too large, causing unbalanced node loads.

1. Choose a proper partition key.

2. Add a new partition key for hashing data.

As the data in the large partition increases, the cluster stability deteriorates.

Table 9 GaussDB(for MySQL)

Event Source

Event Name

Event Severity

Description

Solution

Impact

GaussDB (for MySQL)

Failed to create a DB instance

Major

DB instances fail to be created because the quota is insufficient or underlying resources are exhausted.

Check the DB instance quota. Release resources and create DB instances again.

DB instances cannot be created.

Read replica promotion failure

Major

The read replica fails to be promoted to the primary DB instance due to network or server failures. The original primary DB instance takes over services within a short time.

Submit a service ticket.

The read replica fails to be promoted to the primary DB instance.

Read replica creation failure

Major

Read replicas fail to be created because the quota is insufficient or underlying resources are exhausted.

Check the read replica quota. Release resources and create DB instances again.

Read replicas fail to be created.

Instance class change failure

Major

DB instance classes fail to be changed because the quota is insufficient or underlying resources are exhausted.

Submit a service ticket.

DB instance classes fail to be changed.

Table 10 GaussDB(for openGauss)

Event Source

Event Name

Event Severity

Description

Solution

Impact

GaussDB (for openGauss)

Process Status Alarm

Major

Key processes exit, including: CMS/CMA, ETCD, GTM, CN or DN process.

Wait until the process is automatically recovered or a primary/standby failover is automatically performed. Check whether services are recovered. If no, contact SRE engineers.

If processes on primary nodes are faulty, services are interrupted and then rolled back. If processes on standby nodes are faulty, services are not affected.

Component Status Alarm

Major

Key components do not respond, including: CMA, ETCD, GTM, CN or DN components.

Wait until the process is automatically recovered or a primary/standby failover is automatically performed. Check whether services are recovered. If no, contact SRE engineers.

If processes on primary nodes do not respond, neither do the services. If processes on standby nodes are faulty, services are not affected.

Cluster Status Alarm

Major

The cluster is abnormal, such as:

The cluster is read-only. Majority of ETCDs are faulty. The cluster resources are unevenly distributed.

Contact SRE engineers.

If the cluster is read-only, only read-only requests are processed.

If the majority of ETCDs are fault, the cluster is unavailable.

If the cluster resources are unevenly distributed, the cluster performance and reliability deteriorate.

Hardware Resource Alarm

Major

A major hardware fault occurs in the cluster, such as: A disk is damaged, causing the GTM network communication fault.

Contact SRE engineers.

Some or all services are affected.

Status Transition Alarm

Major

The following events occur in the cluster: DN build failed. Forcible DN promotion. Primary/standby DN switchover/failover Primary/standby GTM switchover/failover

Wait until the fault is automatically rectified and check whether services are recovered. If no, contact SRE engineers.

Some services are interrupted.

Other Abnormal Alarm

Major

Disk usage threshold alarm

Focus on service changes and scale up storage space as needed.

If the used storage space exceeds the threshold, storage space cannot be scaled up.

Table 11 VPC

Event Source

Event Name

Event Severity

VPC

Delete VPC

Major

Modify VPC

Minor

Delete subnet

Minor

Modify subnet

Minor

Modify bandwidth

Minor

Delete VPN

Major

Modify VPN

Minor

Table 12 EVS

Event Source

Event Name

Event Severity

EVS

Update disk

Minor

Expand disk

Minor

Delete disk

Major

Table 13 IAM

Event Source

Event Name

Event Severity

IAM

Login

Minor

Logout

Minor

Change password

Major

Create user

Minor

Delete user

Major

Update user

Minor

Create user group

Minor

Delete user group

Major

Update user group

Minor

Create identity provider

Minor

Delete identity provider

Major

Update identity provider

Minor

Update metadata

Minor

Update security policy

Major

Add credential

Major

Delete credential

Major

Create project

Minor

Update project

Minor

Suspend project

Major

Table 14 KMS

Event Source

Event Name

Event Severity

KMS

Disable key

Major

Schedule key deletion

Minor

Retire grant

Major

Revoke grant

Major

Table 15 OBS

Event Source

Event Name

Event Severity

OBS

Delete bucket

Major

Delete bucket policy

Major

Set bucket ACL

Minor

Set bucket policy

Minor