Help Center/ Cloud Eye/ User Guide/ Event Monitoring/ Events Supported by Event Monitoring
Updated on 2024-08-07 GMT+08:00

Events Supported by Event Monitoring

Table 1 Elastic Cloud Server (ECS)

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

ECS

Restart triggered due to hardware fault

startAutoRecovery

Major

ECSs on a faulty host would be automatically migrated to another properly-running host. During the migration, the ECSs was restarted.

Wait for the event to end and check whether services are affected.

Services may be interrupted.

Restart completed due to hardware failure

endAutoRecovery

Major

The ECS was recovered after the automatic migration.

This event indicates that the ECS has recovered and been working properly.

None

Auto recovery timeout (being processed on the backend)

faultAutoRecovery

Major

Migrating the ECS to a normal host timed out.

Migrate services to other ECSs.

Services are interrupted.

GPU link fault

GPULinkFault

Critical

The GPU of the host on which the ECS is located was faulty or was recovering from a fault.

Deploy service applications in HA mode.

After the GPU fault is rectified, check whether services are restored.

Services are interrupted.

ECS deleted

deleteServer

Major

The ECS was deleted

  • on the management console.
  • by calling APIs.

Check whether the deletion was performed intentionally by a user.

Services are interrupted.

ECS restarted

rebootServer

Minor

The ECS was restarted

  • on the management console.
  • by calling APIs.

Check whether the restart was performed intentionally by a user.

  • Deploy service applications in HA mode.
  • After the ECS starts up, check whether services recover.

Services are interrupted.

ECS stopped

stopServer

Minor

The ECS was stopped

  • on the management console.
  • by calling APIs.
NOTE:

The ECS is stopped only after CTS is enabled.

  • Check whether the restart was performed intentionally by a user.
  • Deploy service applications in HA mode.
  • After the ECS starts up, check whether services recover.

Services are interrupted.

NIC deleted

deleteNic

Major

The ECS NIC was deleted

  • on the management console.
  • by calling APIs.
  • Check whether the deletion was performed intentionally by a user.
  • Deploy service applications in HA mode.
  • After the NIC is deleted, check whether services recover.

Services may be interrupted.

ECS resized

resizeServer

Minor

The ECS specifications were resized

  • on the management console.
  • by calling APIs.
  • Check whether the operation was performed by a user.
  • Deploy service applications in HA mode.
  • After the ECS is resized, check whether services have recovered.

Services are interrupted.

GuestOS restarted

RestartGuestOS

Minor

The guest OS was restarted.

Contact O&M personnel.

Services may be interrupted.

ECS failure due to abnormal host processes

VMFaultsByHostProcessExceptions

Critical

The processes of the host accommodating the ECS were abnormal.

Contact O&M personnel.

The ECS is faulty.

Startup failure

faultPowerOn

Major

The ECS failed to start.

Start the ECS again. If the problem persists, contact O&M personnel.

The ECS cannot start.

Host breakdown risk

hostMayCrash

Major

The host where the ECS resides may break down, and the risk cannot be prevented through live migration due to some reasons.

Migrate services running on the ECS first and delete or stop the ECS. Start the ECS only after the O&M personnel eliminate the risk.

The host may break down, causing service interruption.

Scheduled migration completed

instance_migrate_completed

Major

Scheduled ECS migration is completed.

Wait until the ECSs become available and check whether services are affected.

Services may be interrupted.

Scheduled migration being executed

instance_migrate_executing

Major

ECSs are being migrated as scheduled.

Wait until the event is complete and check whether services are affected.

Services may be interrupted.

Scheduled migration canceled

instance_migrate_canceled

Major

Scheduled ECS migration is canceled.

None

None

Scheduled migration failed

instance_migrate_failed

Major

ECSs failed to be migrated as scheduled.

Contact O&M personnel.

Services are interrupted.

Scheduled migration to be executed

instance_migrate_scheduled

Major

ECSs will be migrated as scheduled.

Check the impact on services during the execution window.

None

Scheduled specification modification failed

instance_resize_failed

Major

Specifications failed to be modified as scheduled.

Contact O&M personnel.

Services are interrupted.

Scheduled specification modification completed

instance_resize_completed

Major

Scheduled specifications modification is completed.

None

None

Scheduled specification modification being executed

instance_resize_executing

Major

Specifications are being modified as scheduled.

Wait until the event is completed and check whether services are affected.

Services are interrupted.

Scheduled specification modification canceled

instance_resize_canceled

Major

Scheduled specifications modification is canceled.

None

None

Scheduled specification modification to be executed

instance_resize_scheduled

Major

Specifications will be modified as scheduled.

Check the impact on services during the execution window.

None

Scheduled redeployment to be executed

instance_redeploy_scheduled

Major

ECSs will be redeployed on new hosts as scheduled.

Check the impact on services during the execution window.

None

Scheduled restart to be executed

instance_reboot_scheduled

Major

ECSs will be restarted as scheduled.

Check the impact on services during the execution window.

None

Scheduled stop to be executed

instance_stop_scheduled

Major

ECSs will be stopped as scheduled as they are affected by underlying hardware or system O&M.

Check the impact on services during the execution window.

None

Live migration started

liveMigrationStarted

Major

The host where the ECS is located may be faulty. Live migrate the ECS in advance to prevent service interruptions caused by host breakdown.

Wait for the event to end and check whether services are affected.

Services may be interrupted for less than 1s.

Live migration completed

liveMigrationCompleted

Major

The live migration is complete, and the ECS is running properly.

Check whether services are running properly.

None

Live migration failure

liveMigrationFailed

Major

An error occurred during the live migration of an ECS.

Check whether services are running properly.

There is a low probability that services are interrupted.

ECC uncorrectable error alarm generated on GPU SRAM

SRAMUncorrectableEccError

Major

There are ECC uncorrectable errors generated on GPU SRAM.

If services are affected, submit a service ticket.

The GPU hardware may be faulty. As a result, the GPU memory is faulty, and services exit abnormally.

FPGA link fault

FPGALinkFault

Critical

The FPGA of the host on which the ECS is located was

  • faulty.
  • recovering from a fault.

Deploy service applications in HA mode.

After the FPGA fault is rectified, check whether services are restored.

Services are interrupted.

Scheduled redeployment to be authorized

instance_redeploy_inquiring

Major

As being affected by underlying hardware or system O&M, ECSs will be redeployed on new hosts as scheduled.

Authorize scheduled redeployment.

None

Local disk replacement canceled

localdisk_recovery_canceled

Major

Local disk failure

None

None

Local disk replacement to be executed

localdisk_recovery_scheduled

Major

Local disk failure

Check the impact on services during the execution window.

None

Xid event alarm generated on GPU

commonXidError

Major

A xid event alarm occurs on GPU.

If services are affected, submit a service ticket.

The GPU hardware, driver, and application problems lead to Xid events, which may lead to abnormal exit of the business.

nvidia-smi suspended

nvidiaSmiHangEvent

Major

nvidia-smi timed out.

If services are affected, submit a service ticket.

The driver may report an error during service running.

NPU: uncorrectable ECC error

UncorrectableEccErrorCount

Major

There are uncorrectable ECC errors generated on GPU SRAM.

If services are affected, replace the NPU with another one.

Services may be interrupted.

Scheduled redeployment canceled

instance_redeploy_canceled

Major

As being affected by underlying hardware or system O&M, ECSs will be redeployed on new hosts as scheduled.

None

None

Scheduled redeployment being executed

instance_redeploy_executing

Major

As being affected by underlying hardware or system O&M, ECSs will be redeployed on new hosts as scheduled.

Wait until the event is complete and check whether services are affected.

Services are interrupted.

Scheduled redeployment completed

instance_redeploy_completed

Major

As being affected by underlying hardware or system O&M, ECSs will be redeployed on new hosts as scheduled.

Wait until the redeployed ECSs are available and check whether services are affected.

None

Scheduled redeployment failed

instance_redeploy_failed

Major

As being affected by underlying hardware or system O&M, ECSs will be redeployed on new hosts as scheduled.

Contact O&M personnel.

Services are interrupted.

Local disk replacement to be authorized

localdisk_recovery_inquiring

Major

Local disks are faulty.

Authorize local disk replacement.

Local disks are unavailable.

Local disks being replaced

localdisk_recovery_executing

Major

Local disk failure

Wait until the local disks are replaced and check whether the local disks are available.

Local disks are unavailable.

Local disks replaced

localdisk_recovery_completed

Major

Local disk failure

Wait until the services are running properly and check whether local disks are available.

None

Local disk replacement failed

localdisk_recovery_failed

Major

Local disks are faulty.

Contact O&M personnel.

Local disks are unavailable.

Once a physical host running ECSs breaks down, the ECSs are automatically migrated to a functional physical host. During the migration, the ECSs will be restarted.

Table 2 Bare Metal Server (BMS)

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

BMS

ECC uncorrectable error alarm generated on GPU SRAM

SRAMUncorrectableEccError

Major

There are ECC uncorrectable errors generated on GPU SRAM.

If services are affected, submit a service ticket.

The GPU hardware may be faulty. As a result, the GPU memory is faulty, and services exit abnormally.

BMS restarted

osReboot

Major

The BMS was restarted

  • on the management console.
  • by calling APIs.
  • Deploy service applications in HA mode.
  • After the BMS is restarted, check whether services recover.

Services are interrupted.

Unexpected restart

serverReboot

Major

The BMS restarted unexpectedly, which may be caused by

  • OS faults.
  • hardware faults.
  • Deploy service applications in HA mode.
  • After the BMS is restarted, check whether services recover.

Services are interrupted.

BMS stopped

osShutdown

Major

The BMS was stopped

  • on the management console.
  • by calling APIs.
  • Deploy service applications in HA mode.
  • After the BMS is started, check whether services recover.

Services are interrupted.

Unexpected shutdown

serverShutdown

Major

The BMS was stopped unexpectedly, which may be caused by

  • unexpected power-off.
  • hardware faults.
  • Deploy service applications in HA mode.
  • After the BMS is started, check whether services recover.

Services are interrupted.

Network disconnection

linkDown

Major

The BMS network was disconnected. Possible causes are as follows:

  • The BMS was unexpectedly stopped or restarted.
  • The switch was faulty.
  • The gateway was faulty.
  • Deploy service applications in HA mode.
  • After the BMS is started, check whether services recover.

Services are interrupted.

PCIe error

pcieError

Major

The PCIe devices or main board of the BMS was faulty.

  • Deploy service applications in HA mode.
  • After the BMS is started, check whether services recover.

The network or disk read/write services are affected.

Disk fault

diskError

Major

The disk backplane or disks of the BMS were faulty.

  • Deploy service applications in HA mode.
  • After the fault is rectified, check whether services recover.

Data read/write services are affected, or the BMS cannot be started.

EVS error

storageError

Major

The BMS failed to connect to EVS disks. Possible causes are as follows:

  • The SDI card was faulty.
  • Remote storage devices were faulty.
  • Deploy service applications in HA mode.
  • After the fault is rectified, check whether services recover.

Data read/write services are affected, or the BMS cannot be started.

Inforom alarm generated on GPU

gpuInfoROMAlarm

Major

The driver failed to read inforom information due to GPU faults.

Non-critical services can continue to use the GPU card. For critical services, submit a service ticket to resolve this issue.

Services will not be affected if inforom information cannot be read. If error correction code (ECC) errors are reported on GPU, faulty pages may not be automatically retired and services are affected.

Double-bit ECC alarm generated on GPU

doubleBitEccError

Major

A double-bit ECC error occurred on GPU.

  1. If services are interrupted, restart the services to restore.
  2. If services cannot be restarted, restart the VM where services are running.
  3. If services still cannot be restored, submit a service ticket.

Services may be interrupted. After faulty pages are retired, the GPU card can continue to be used.

Too many retired pages

gpuTooManyRetiredPagesAlarm

Major

An ECC page retirement error occurred on GPU.

If services are affected, submit a service ticket.

Services may be affected.

ECC alarm generated on GPU A100

gpuA100EccAlarm

Major

An ECC error occurred on GPU.

  1. If services are interrupted, restart the services to restore.
  2. If services cannot be restarted, restart the VM where services are running.
  3. If services still cannot be restored, submit a service ticket.

Services may be interrupted. After faulty pages are retired, the GPU card can continue to be used.

GPU ECC memory page retirement failure

eccPageRetirementRecordingFailure

Major

Automatic page retirement failed due to ECC errors.

  1. If services are interrupted, restart the services to restore.
  2. If services cannot be restarted, restart the VM where services are running.
  3. If services still cannot be restored, submit a service ticket.

Services may be interrupted, and memory page retirement fails. As a result, services cannot no longer use the GPU card.

GPU ECC page retirement alarm generated

eccPageRetirementRecordingEvent

Minor

Memory pages are automatically retired due to ECC errors.

  1. If services are interrupted, restart the services to restore.
  2. If services cannot be restarted, restart the VM where services are running.
  3. If services still cannot be restored, submit a service ticket.

Generally, this alarm is generated together with the ECC error alarm. If this alarm is generated independently, services are not affected.

Too many single-bit ECC errors on GPU

highSingleBitEccErrorRate

Major

There are too many single-bit ECC errors.

  1. If services are interrupted, restart the services to restore.
  2. If services cannot be restarted, restart the VM where services are running.
  3. If services still cannot be restored, submit a service ticket.

Single-bit errors can be automatically rectified and do not affect GPU-related applications.

GPU card not found

gpuDriverLinkFailureAlarm

Major

A GPU link is normal, but the NVIDIA driver cannot find the GPU card.

  1. Restart the VM to restore services.
  2. If services still cannot be restored, submit a service ticket.

The GPU card cannot be found.

GPU link faulty

gpuPcieLinkFailureAlarm

Major

GPU hardware information cannot be queried through lspci due to a GPU link fault.

If services are affected, submit a service ticket.

The driver cannot use GPU.

GPU card lost

vmLostGpuAlarm

Major

The number of GPU cards on the VM is less than the number specified in the specifications.

If services are affected, submit a service ticket.

GPU cards get lost.

GPU memory page faulty

gpuMemoryPageFault

Major

The GPU memory page is faulty, which may be caused by applications, drivers, or hardware.

If services are affected, submit a service ticket.

The GPU hardware may be faulty. As a result, the GPU memory is faulty, and services exit abnormally.

GPU image engine faulty

graphicsEngineException

Major

The GPU image engine is faulty, which may be caused by applications, drivers, or hardware.

If services are affected, submit a service ticket.

The GPU hardware may be faulty. As a result, the image engine is faulty, and services exit abnormally.

GPU temperature too high

highTemperatureEvent

Major

GPU temperature too high

If services are affected, submit a service ticket.

If the GPU temperature exceeds the threshold, the GPU performance may deteriorate.

GPU NVLink faulty

nvlinkError

Major

A hardware fault occurs on the NVLink.

If services are affected, submit a service ticket.

The NVLink link is faulty and unavailable.

System maintenance inquiring

system_maintenance_inquiring

Major

The scheduled BMS maintenance task is being inquired.

Authorize the maintenance.

None

System maintenance waiting

system_maintenance_scheduled

Major

The scheduled BMS maintenance task is waiting to be executed.

Clarify the impact on services during the execution window and ensure that the impact is acceptable to users.

None

System maintenance canceled

system_maintenance_canceled

Major

The scheduled BMS maintenance is canceled.

None

None

System maintenance executing

system_maintenance_executing

Major

BMSs are being maintained as scheduled.

After the maintenance is complete, check whether services are affected.

Services are interrupted.

System maintenance completed

system_maintenance_completed

Major

The scheduled BMS maintenance is completed.

Wait until the BMSs become available and check whether services recover.

None

System maintenance failure

system_maintenance_failed

Major

The scheduled BMS maintenance task failed.

Contact O&M personnel.

Services are interrupted.

GPU Xid error

commonXidError

Major

An Xid event alarm is generated on the GPU.

If services are affected, submit a service ticket.

An Xid error is caused by GPU hardware, driver, or application problems, which may result in abnormal service exit.

NPU: device not found by npu-smi info

NPUSMICardNotFound

Major

The Ascend driver is faulty or the NPU is disconnected.

Transfer this issue to the Ascend or hardware team for handling.

The NPU cannot be used normally.

NPU: PCIe link error

PCIeErrorFound

Major

The lspci command returns rev ff indicating that the NPU is abnormal.

Restart the BMS. If the issue persists, transfer it to the hardware team for processing.

The NPU cannot be used normally.

NPU: device not found by lspci

LspciCardNotFound

Major

The NPU is disconnected.

Transfer this issue to the hardware team for handling.

The NPU cannot be used normally.

NPU: overtemperature

TemperatureOverUpperLimit

Major

The temperature of DDR or software is too high.

Stop services, restart the BMS, check the heat dissipation system, and reset the devices.

The BMS may be powered off and devices may not be found.

NPU: uncorrectable ECC error

UncorrectableEccErrorCount

Major

There are uncorrectable ECC errors generated on GPU SRAM.

If services are affected, replace the NPU with another one.

Services may be interrupted.

NPU: request for BMS restart

RebootVirtualMachine

Informational

A fault occurs and the BMS needs to be restarted.

Collect the fault information, and restart the BMS.

Services may be interrupted.

NPU: request for SoC reset

ResetSOC

Informational

A fault occurs and the SoC needs to be reset.

Collect the fault information, and reset the SoC.

Services may be interrupted.

NPU: request for restart AI process

RestartAIProcess

Informational

A fault occurs and the AI process needs to be restarted.

Collect the fault information, and restart the AI process.

The current AI task will be interrupted.

NPU: error codes

NPUErrorCodeWarning

Major

A large number of NPU error codes indicating major or higher-level errors are returned. You can further locate the faults based on the error codes.

Locate the faults according to the Black Box Error Code Information List and Health Management Error Definition.

Services may be interrupted.

nvidia-smi suspended

nvidiaSmiHangEvent

Major

nvidia-smi timed out.

If services are affected, submit a service ticket.

The driver may report an error during service running.

nv_peer_mem loading error

NvPeerMemException

Minor

The NVLink or nv_peer_mem cannot be loaded.

Restore or reinstall the NVLink.

nv_peer_mem cannot be used.

Fabric Manager error

NvFabricManagerException

Minor

The BMS meets the NVLink conditions and NVLink is installed, but Fabric Manager is abnormal.

Restore or reinstall the NVLink.

NVLink cannot be used normally.

IB card error

InfinibandStatusException

Major

The IB card or its physical status is abnormal.

Transfer this issue to the hardware team for handling.

The IB card cannot work normally.

Table 3 Elastic IP (EIP)

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

EIP

EIP bandwidth exceeded

EIPBandwidthOverflow

Major

The used bandwidth exceeded the purchased one, which may slow down the network or cause packet loss. The value of this event is the maximum value in a monitoring period, and the value of the EIP inbound and outbound bandwidth is the value at a specific time point in the period.

The metrics are described as follows:

egressDropBandwidth: dropped outbound packets (bytes)

egressAcceptBandwidth: accepted outbound packets (bytes)

egressMaxBandwidthPerSec: peak outbound bandwidth (byte/s)

ingressAcceptBandwidth: accepted inbound packets (bytes)

ingressMaxBandwidthPerSec: peak inbound bandwidth (byte/s)

ingressDropBandwidth: dropped inbound packets (bytes)

NOTE:

EIP bandwidth overflow is available only in the following regions: CN North-Beijing1, CN North-Beijing4, CN North-Ulanqab1, CN East-Shanghai1, CN East-Shanghai2, CN Southwest-Guiyang1, and CN South-Guangzhou.

Check whether the EIP bandwidth keeps increasing and whether services are normal. Increase bandwidth if necessary.

The network becomes slow or packets are lost.

EIP released

deleteEip

Minor

The EIP was released.

Check whether the EIP was release by mistake.

The server that has the EIP bound cannot access the Internet.

EIP blocked

blockEIP

Critical

The used bandwidth of an EIP exceeded 5 Gbit/s, the EIP were blocked and packets were discarded. Such an event may be caused by DDoS attacks.

Replace the EIP to prevent services from being affected.

Locate and deal with the fault.

Services are impacted.

EIP unblocked

unblockEIP

Critical

The EIP was unblocked.

Use the previous EIP again.

None

EIP traffic scrubbing started

ddosCleanEIP

Major

Traffic scrubbing on the EIP was started to prevent DDoS attacks.

Check whether the EIP was attacked.

Services may be interrupted.

EIP traffic scrubbing ended

ddosEndCleanEip

Major

Traffic scrubbing on the EIP to prevent DDoS attacks was ended.

Check whether the EIP was attacked.

Services may be interrupted.

QoS bandwidth exceeded

EIPBandwidthRuleOverflow

Major

The used QoS bandwidth exceeded the allocated one, which may slow down the network or cause packet loss. The value of this event is the maximum value in a monitoring period, and the value of the EIP inbound and outbound bandwidth is the value at a specific time point in the period.

egressDropBandwidth: dropped outbound packets (bytes)

egressAcceptBandwidth: accepted outbound packets (bytes)

egressMaxBandwidthPerSec: peak outbound bandwidth (byte/s)

ingressAcceptBandwidth: accepted inbound packets (bytes)

ingressMaxBandwidthPerSec: peak inbound bandwidth (byte/s)

ingressDropBandwidth: dropped inbound packets (bytes)

Check whether the EIP bandwidth keeps increasing and whether services are normal. Increase bandwidth if necessary.

The network becomes slow or packets are lost.

Table 4 Advanced Anti-DDoS (AAD)

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

AAD

DDoS Attack Events

ddosAttackEvents

Major

A DDoS attack occurs in the AAD protected lines.

Judge the impact on services based on the attack traffic and attack type. If the attack traffic exceeds your purchased elastic bandwidth, change to another line or increase your bandwidth.

Services may be interrupted.

Domain name scheduling event

domainNameDispatchEvents

Major

The high-defense CNAME corresponding to the domain name is scheduled, and the domain name is resolved to another high-defense IP address.

Pay attention to the workloads involving the domain name.

Services are not affected.

Blackhole event

blackHoleEvents

Major

The attack traffic exceeds the purchased AAD protection threshold.

A blackhole is canceled after 30 minutes by default. The actual blackhole duration is related to the blackhole triggering times and peak attack traffic on the current day. The maximum duration is 24 hours. If you need to permit access before a blackhole becomes ineffective, contact technical support.

Services may be interrupted.

Cancel Blackhole

cancelBlackHole

Informational

The customer's AAD instance recovers from the black hole state.

This is only a prompt and no action is required.

Customer services recover.

IP address scheduling triggered

ipDispatchEvents

Major

IP route changed

Check the workloads of the IP address.

Services are not affected.

Table 5 Elastic Load Balance (ELB)

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

ELB

The backend servers are unhealthy.

healthCheckUnhealthy

Major

Generally, this problem occurs because backend server services are offline. This event will not be reported after it is reported for several times.

Ensure that the backend servers are running properly.

ELB does not forward requests to unhealthy backend servers. If all backend servers in the backend server group are detected unhealthy, services will be interrupted.

The backend server is detected healthy.

healthCheckRecovery

Minor

The backend server is detected healthy.

No further action is required.

The load balancer can properly route requests to the backend server.

Table 6 Cloud Backup and Recovery (CBR)

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

CBR

Failed to create the backup.

backupFailed

Critical

The backup failed to be created.

Manually create a backup or contact customer service.

Data loss may occur.

Failed to restore the resource using a backup.

restorationFailed

Critical

The resource failed to be restored using a backup.

Restore the resource using another backup or contact customer service.

Data loss may occur.

Failed to delete the backup.

backupDeleteFailed

Critical

The backup failed to be deleted.

Try again later or contact customer service.

Charging may be abnormal.

Failed to delete the vault.

vaultDeleteFailed

Critical

The vault failed to be deleted.

Try again later or contact technical support.

Charging may be abnormal.

Replication failure

replicationFailed

Critical

The backup failed to be replicated.

Try again later or contact technical support.

Data loss may occur.

The backup is created successfully.

backupSucceeded

Major

The backup was created.

None

None

Resource restoration using a backup succeeded.

restorationSucceeded

Major

The resource was restored using a backup.

Check whether the data is successfully restored.

None

The backup is deleted successfully.

backupDeletionSucceeded

Major

The backup was deleted.

None

None

The vault is deleted successfully.

vaultDeletionSucceeded

Major

The vault was deleted.

None

None

Replication success

replicationSucceeded

Major

The backup was replicated successfully.

None

None

Client offline

agentOffline

Critical

The backup client was offline.

Ensure that the Agent status is normal and the backup client can be connected to Huawei Cloud.

Backup tasks may fail.

Client online

agentOnline

Major

The backup client was online.

None

None

Table 7 Relational Database Service (RDS) — resource exception

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

RDS

DB instance creation failure

createInstanceFailed

Major

Generally, the cause is that the number of disks is insufficient due to quota limits, or underlying resources are exhausted.

The selected resource specifications are insufficient. Select other available specifications and try again.

DB instances cannot be created.

Full backup failure

fullBackupFailed

Major

A single full backup failure does not affect the files that have been successfully backed up, but prolong the incremental backup time during the point-in-time restore (PITR).

Try again.

Restoration using backups will be affected.

Read replica promotion failure

activeStandBySwitchFailed

Major

The standby DB instance does not take over workloads from the primary DB instance due to network or server failures. The original primary DB instance continues to provide services within a short time.

Perform the operation again during off-peak hours.

Read replica promotion failed.

Replication status abnormal

abnormalReplicationStatus

Major

The possible causes are as follows:

The replication delay between the primary instance and the standby instance or a read replica is too long, which usually occurs when a large amount of data is being written to databases or a large transaction is being processed. During peak hours, data may be blocked.

The network between the primary instance and the standby instance or a read replica is disconnected.

The issue is being fixed. Please wait for our notifications.

The replication status is abnormal.

Replication status recovered

replicationStatusRecovered

Major

The replication delay between the primary and standby instances is within the normal range, or the network connection between them has restored.

Check whether services are running properly.

Replication status is recovered.

DB instance faulty

faultyDBInstance

Major

A single or primary DB instance was faulty due to a catastrophic failure, for example, server failure.

The issue is being fixed. Please wait for our notifications.

The instance status is abnormal.

DB instance recovered

DBInstanceRecovered

Major

RDS rebuilds the standby DB instance with its high availability. After the instance is rebuilt, this event will be reported.

The DB instance status is normal. Check whether services are running properly.

The instance is recovered.

Failure of changing single DB instance to primary/standby

singleToHaFailed

Major

A fault occurs when RDS is creating the standby DB instance or configuring replication between the primary and standby DB instances. The fault may occur because resources are insufficient in the data center where the standby DB instance is located.

Automatic retry is in progress.

Changing a single DB instance to primary/standby failed.

Database process restarted

DatabaseProcessRestarted

Major

The database process is stopped due to insufficient memory or high load.

Check whether services are running properly.

The primary instance is restarted. Services are interrupted for a short period of time.

Instance storage full

instanceDiskFull

Major

Generally, the cause is that the data space usage is too high.

Scale up the storage.

The instance storage is used up. No data can be written into databases.

Instance storage full recovered

instanceDiskFullRecovered

Major

The instance disk is recovered.

Check whether services are running properly.

The instance has available storage.

Kafka connection failed

kafkaConnectionFailed

Major

The network is unstable or the Kafka server does not work properly.

Check whether services are affected.

None

Table 8 Relational Database Service (RDS) — operations

Event Source

Event Name

Event ID

Event Severity

Description

RDS

Reset administrator password

resetPassword

Major

The password of the database administrator is reset.

Operate DB instance

instanceAction

Major

The storage space is scaled or the instance class is changed.

Delete DB instance

deleteInstance

Minor

The DB instance is deleted.

Modify backup policy

setBackupPolicy

Minor

The backup policy is modified.

Modify parameter group

updateParameterGroup

Minor

The parameter group is modified.

Delete parameter group

deleteParameterGroup

Minor

The parameter group is deleted.

Reset parameter group

resetParameterGroup

Minor

The parameter group is reset.

Change database port

changeInstancePort

Major

The database port is changed.

Primary/standby switchover or failover

PrimaryStandbySwitched

Major

A switchover or failover is performed.

Table 9 Document Database Service (DDS)

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

DDS

DB instance creation failure

DDSCreateInstanceFailed

Major

A DDS instance fails to be created due to insufficient disks, quotas, and underlying resources.

Check the number and quota of disks. Release resources and create DDS instances again.

DDS instances cannot be created.

Replication failed

DDSAbnormalReplicationStatus

Major

The possible causes are as follows:

The replication delay between the primary instance and the standby instance or a read replica is too long, which usually occurs when a large amount of data is being written to databases or a large transaction is being processed. During peak hours, data may be blocked.

The network between the primary instance and the standby instance or a read replica is disconnected.

Submit a service ticket.

Your applications are not affected because this event does not interrupt data read and write.

Replication recovered

DDSReplicationStatusRecovered

Major

The replication delay between the primary and standby instances is within the normal range, or the network connection between them has restored.

No action is required.

None

DB instance failed

DDSFaultyDBInstance

Major

This event is a key alarm event and is reported when an instance is faulty due to a disaster or a server failure.

Submit a service ticket.

The database service may be unavailable.

DB instance recovered

DDSDBInstanceRecovered

Major

If a disaster occurs, NoSQL provides an HA tool to automatically or manually rectify the fault. After the fault is rectified, this event is reported.

No action is required.

None

Faulty node

DDSFaultyDBNode

Major

This event is a key alarm event and is reported when a database node is faulty due to a disaster or a server failure.

Check whether the database service is available and submit a service ticket.

The database service may be unavailable.

Node recovered

DDSDBNodeRecovered

Major

If a disaster occurs, NoSQL provides an HA tool to automatically or manually rectify the fault. After the fault is rectified, this event is reported.

No action is required.

None

Primary/standby switchover or failover

DDSPrimaryStandbySwitched

Major

A primary/standby switchover is performed or a failover is triggered.

No action is required.

None

Insufficient storage space

DDSRiskyDataDiskUsage

Major

The storage space is insufficient.

Scale up storage space. For details, see section "Scaling Up Storage Space" in the corresponding user guide.

The instance is set to read-only and data cannot be written to the instance.

Data disk expanded and being writable

DDSDataDiskUsageRecovered

Major

The capacity of a data disk has been expanded and the data disk becomes writable.

No further action is required.

No adverse impact.

Schedule for deleting a KMS key

DDSplanDeleteKmsKey

Major

A request to schedule deletion of a KMS key was submitted.

After the KMS key is scheduled to be deleted, either decrypt the data encrypted by KMS key in a timely manner or cancel the key deletion.

After the KMS key is deleted, users cannot encrypt disks.

Table 10 GaussDB NoSQL

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

GaussDB NoSQL

DB instance creation failed

NoSQLCreateInstanceFailed

Major

The instance quota or underlying resources are insufficient.

Release the instances that are no longer used and try to provision them again, or submit a service ticket to adjust the quota.

DB instances cannot be created.

Specifications modification failed

NoSQLResizeInstanceFailed

Major

The underlying resources are insufficient.

Submit a service ticket. The O&M personnel will coordinate resources in the background, and then you need to change the specifications again.

Services are interrupted.

Node adding failed

NoSQLAddNodesFailed

Major

The underlying resources are insufficient.

Submit a service ticket. The O&M personnel will coordinate resources in the background, and then you delete the node that failed to be added and add a new node.

None

Node deletion failed

NoSQLDeleteNodesFailed

Major

The underlying resources fail to be released.

Delete the node again.

None

Storage space scale-up failed

NoSQLScaleUpStorageFailed

Major

The underlying resources are insufficient.

Submit a service ticket. The O&M personnel will coordinate resources in the background and then you scale up the storage space again.

Services may be interrupted.

Password reset failed

NoSQLResetPasswordFailed

Major

Resetting the password times out.

Reset the password again.

None

Parameter group change failed

NoSQLUpdateInstanceParamGroupFailed

Major

Changing a parameter group times out.

Change the parameter group again.

None

Backup policy configuration failed

NoSQLSetBackupPolicyFailed

Major

The database connection is abnormal.

Configure the backup policy again.

None

Manual backup creation failed

NoSQLCreateManualBackupFailed

Major

The backup files fail to be exported or uploaded.

Submit a service ticket to the O&M personnel.

Data cannot be backed up.

Automated backup creation failed

NoSQLCreateAutomatedBackupFailed

Major

The backup files fail to be exported or uploaded.

Submit a service ticket to the O&M personnel.

Data cannot be backed up.

Faulty DB instance

NoSQLFaultyDBInstance

Major

This event is a key alarm event and is reported when an instance is faulty due to a disaster or a server failure.

Submit a service ticket.

The database service may be unavailable.

DB instance recovered

NoSQLDBInstanceRecovered

Major

If a disaster occurs, NoSQL provides an HA tool to automatically or manually rectify the fault. After the fault is rectified, this event is reported.

No action is required.

None

Faulty node

NoSQLFaultyDBNode

Major

This event is a key alarm event and is reported when a database node is faulty due to a disaster or a server failure.

Check whether the database service is available and submit a service ticket.

The database service may be unavailable.

Node recovered

NoSQLDBNodeRecovered

Major

If a disaster occurs, NoSQL provides an HA tool to automatically or manually rectify the fault. After the fault is rectified, this event is reported.

No action is required.

None

Primary/standby switchover or failover

NoSQLPrimaryStandbySwitched

Major

This event is reported when a primary/standby switchover is performed or a failover is triggered.

No action is required.

None

HotKey occurred

HotKeyOccurs

Major

The primary key is improperly configured. As a result, hotspot data is distributed in one partition. The improper application design causes frequent read and write operations on a key.

1. Choose a proper partition key.

2. Add service cache. The service application reads hotspot data from the cache first.

The service request success rate is affected, and the cluster performance and stability also be affected.

BigKey occurred

BigKeyOccurs

Major

The primary key design is improper. The number of records or data in a single partition is too large, causing unbalanced node loads.

1. Choose a proper partition key.

2. Add a new partition key for hashing data.

As the data in the large partition increases, the cluster stability deteriorates.

Insufficient storage space

NoSQLRiskyDataDiskUsage

Major

The storage space is insufficient.

Scale up storage space. For details, see section "Scaling Up Storage Space" in the corresponding user guide.

The instance is set to read-only and data cannot be written to the instance.

Data disk expanded and being writable

NoSQLDataDiskUsageRecovered

Major

The capacity of a data disk has been expanded and the data disk becomes writable.

No operation is required.

None

Index creation failed

NoSQLCreateIndexFailed

Major

The service load exceeds what the instance specifications can take. In this case, creating indexes consumes more instance resources. As a result, the response is slow or even frame freezing occurs, and the creation times out.

Select the matched instance specifications based on the service load.

Create indexes during off-peak hours.

Create indexes in the background.

Select indexes as required.

The index fails to be created or is incomplete. As a result, the index is invalid. Delete the index and create an index.

Write speed decreased

NoSQLStallingOccurs

Major

The write speed is fast, which is close to the maximum write capability allowed by the cluster scale and instance specifications. As a result, the flow control mechanism of the database is triggered, and requests may fail.

1. Adjust the cluster scale or node specifications based on the maximum write rate of services.

2. Measures the maximum write rate of services.

The success rate of service requests is affected.

Data write stopped

NoSQLStoppingOccurs

Major

The data write is too fast, reaching the maximum write capability allowed by the cluster scale and instance specifications. As a result, the flow control mechanism of the database is triggered, and requests may fail.

1. Adjust the cluster scale or node specifications based on the maximum write rate of services.

2. Measures the maximum write rate of services.

The success rate of service requests is affected.

Database restart failed

NoSQLRestartDBFailed

Major

The instance status is abnormal.

Submit a service ticket to the O&M personnel.

The DB instance status may be abnormal.

Restoration to new DB instance failed

NoSQLRestoreToNewInstanceFailed

Major

The underlying resources are insufficient.

Submit a service order to ask the O&M personnel to coordinate resources in the background and add new nodes.

Data cannot be restored to a new DB instance.

Restoration to existing DB instance failed

NoSQLRestoreToExistInstanceFailed

Major

The backup file fails to be downloaded or restored.

Submit a service ticket to the O&M personnel.

The current DB instance may be unavailable.

Backup file deletion failed

NoSQLDeleteBackupFailed

Major

The backup files fail to be deleted from OBS.

Delete the backup files again.

None

Failed to enable Show Original Log

NoSQLSwitchSlowlogPlainTextFailed

Major

The DB engine does not support this function.

Refer to the GaussDB NoSQL User Guide to ensure that the DB engine supports Show Original Log. Submit a service ticket to the O&M personnel.

None

EIP binding failed

NoSQLBindEipFailed

Major

The node status is abnormal, an EIP has been bound to the node, or the EIP to be bound is invalid.

Check whether the node is normal and whether the EIP is valid.

The DB instance cannot be accessed from the Internet.

EIP unbinding failed

NoSQLUnbindEipFailed

Major

The node status is abnormal or the EIP has been unbound from the node.

Check whether the node and EIP status are normal.

None

Parameter modification failed

NoSQLModifyParameterFailed

Major

The parameter value is invalid.

Check whether the parameter value is within the valid range and submit a service ticket to the O&M personnel.

None

Parameter group application failed

NoSQLApplyParameterGroupFailed

Major

The instance status is abnormal. As a result, the parameter group cannot be applied.

Submit a service ticket to the O&M personnel.

None

Failed to enable or disable SSL

NoSQLSwitchSSLFailed

Major

Enabling or disabling SSL times out.

Try again or submit a service ticket. Do not change the connection mode.

The connection mode cannot be changed.

Row size too large

LargeRowOccurs

Major

If there is too much data in a single row, queries may time out, causing faults like OOM error.

1. Control the length of each column and row so that the sum of key and value lengths in each row does not exceed the preset threshold.

2. Check whether there are invalid writes or encoding resulting in large keys or values.

If there are rows that are too large, the cluster performance will deteriorate as the data volume grows.

Schedule for deleting a KMS key

NoSQLplanDeleteKmsKey

Major

A request to schedule deletion of a KMS key was submitted.

After the KMS key is scheduled to be deleted, either decrypt the data encrypted by KMS key in a timely manner or cancel the key deletion.

After the KMS key is deleted, users cannot encrypt disks.

Too many query tombstones

TooManyQueryTombstones

Major

If there are too many query tombstones, queries may time out, affecting query performance.

Select right query and deleting methods and avoid long range queries.

Queries may time out, affecting query performance.

Too large collection column

TooLargeCollectionColumn

Major

If there are too many elements in a collection column, queries to the column will fail.

  1. Limit elements in a collection column.
  2. Check for abnormal writes or coding at the service side.

Queries to the collection column will fail.

Table 11 GaussDB(for MySQL)

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

GaussDB(for MySQL)

Incremental backup failure

TaurusIncrementalBackupInstanceFailed

Major

The network between the instance and the management plane (or the OBS) is disconnected, or the backup environment created for the instance is abnormal.

Submit a service ticket.

Backup jobs fail.

Read replica creation failure

addReadonlyNodesFailed

Major

The quota is insufficient or underlying resources are exhausted.

Check the read replica quota. Release resources and create read replicas again.

Read replicas fail to be created.

DB instance creation failure

createInstanceFailed

Major

The instance quota or underlying resources are insufficient.

Check the instance quota. Release resources and create instances again.

DB instances fail to be created.

Read replica promotion failure

activeStandBySwitchFailed

Major

The read replica fails to be promoted to the primary node due to network or server failures. The original primary node takes over services quickly.

Submit a service ticket.

The read replica fails to be promoted to the primary node.

Instance specifications change failure

flavorAlterationFailed

Major

The quota is insufficient or underlying resources are exhausted.

Submit a service ticket.

Instance specifications fail to be changed.

Faulty DB instance

TaurusInstanceRunningStatusAbnormal

Major

The instance process is faulty or the communications between the instance and the DFV storage are abnormal.

Submit a service ticket.

Services may be affected.

DB instance recovered

TaurusInstanceRunningStatusRecovered

Major

The instance is recovered.

Observe the service running status.

None

Faulty node

TaurusNodeRunningStatusAbnormal

Major

The node process is faulty or the communications between the node and the DFV storage are abnormal.

Observe the instance and service running statuses.

A read replica may be promoted to the primary node.

Node recovered

TaurusNodeRunningStatusRecovered

Major

The node is recovered.

Observe the service running status.

None

Read replica deletion failure

TaurusDeleteReadOnlyNodeFailed

Major

The communications between the management plane and the read replica are abnormal or the VM fails to be deleted from IaaS.

Submit a service ticket.

Read replicas fail to be deleted.

Password reset failure

TaurusResetInstancePasswordFailed

Major

The communications between the management plane and the instance are abnormal or the instance is abnormal.

Check the instance status and try again. If the fault persists, submit a service ticket.

Passwords fail to be reset for instances.

DB instance reboot failure

TaurusRestartInstanceFailed

Major

The network between the management plane and the instance is abnormal or the instance is abnormal.

Check the instance status and try again. If the fault persists, submit a service ticket.

Instances fail to be rebooted.

Restoration to new DB instance failure

TaurusRestoreToNewInstanceFailed

Major

The instance quota is insufficient, underlying resources are exhausted, or the data restoration logic is incorrect.

If the new instance fails to be created, check the instance quota, release resources, and try to restore to a new instance again. In other cases, submit a service ticket.

Backup data fails to be restored to new instances.

EIP binding failure

TaurusBindEIPToInstanceFailed

Major

The binding task fails.

Submit a service ticket.

EIPs fail to be bound to instances.

EIP unbinding failure

TaurusUnbindEIPFromInstanceFailed

Major

The unbinding task fails.

Submit a service ticket.

EIPs fail to be unbound from instances.

Parameter modification failure

TaurusUpdateInstanceParameterFailed

Major

The network between the management plane and the instance is abnormal or the instance is abnormal.

Check the instance status and try again. If the fault persists, submit a service ticket.

Instance parameters fail to be modified.

Parameter template application failure

TaurusApplyParameterGroupToInstanceFailed

Major

The network between the management plane and instances is abnormal or the instances are abnormal.

Check the instance status and try again. If the fault persists, submit a service ticket.

Parameter templates fail to be applied to instances.

Full backup failure

TaurusBackupInstanceFailed

Major

The network between the instance and the management plane (or the OBS) is disconnected, or the backup environment created for the instance is abnormal.

Submit a service ticket.

Backup jobs fail.

Primary/standby failover

TaurusActiveStandbySwitched

Major

When the network, physical machine, or database of the primary node is faulty, the system promotes a read replica to primary based on the failover priority to ensure service continuity.

  1. Check whether the service is running properly.
  2. Check whether an alarm is generated, indicating that the read replica failed to be promoted to primary.

During the failover, database connection is interrupted for a short period of time. After the failover is complete, you can reconnect to the database.

Database read-only

NodeReadonlyMode

Major

The database supports only query operations.

Submit a service ticket.

After the database becomes read-only, write operations cannot be processed.

Database read/write

NodeReadWriteMode

Major

The database supports both write and read operations.

Submit a service ticket.

None.

Instance DR switchover

DisasterSwitchOver

Major

If an instance is faulty and unavailable, a switchover is performed to ensure that the instance continues to provide services.

Contact technical support.

The database connection is intermittently interrupted. The HA service switches workloads from the primary node to a read replica and continues to provide services.

Database process restarted

TaurusDatabaseProcessRestarted

Major

The database process is stopped due to insufficient memory or high load.

Log in to the Cloud Eye console. Check whether the memory usage increases sharply or the CPU usage is too high for a long time. You can increase the specifications or optimize the service logic.

When the database process is suspended, workloads on the node are interrupted. In this case, the HA service automatically restarts the database process and attempts to recover the workloads.

Table 12 GaussDB

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

GaussDB

Process status alarm

ProcessStatusAlarm

Major

Key processes exit, including CMS/CMA, ETCD, GTM, CN, and DN processes.

Wait until the process is automatically recovered or a primary/standby failover is automatically performed. Check whether services are recovered. If no, contact SRE engineers.

If processes on primary nodes are faulty, services are interrupted and then rolled back. If processes on standby nodes are faulty, services are not affected.

Component status alarm

ComponentStatusAlarm

Major

Key components do not respond, including CMA, ETCD, GTM, CN, and DN components.

Wait until the process is automatically recovered or a primary/standby failover is automatically performed. Check whether services are recovered. If no, contact SRE engineers.

If processes on primary nodes do not respond, neither do the services. If processes on standby nodes are faulty, services are not affected.

Cluster status alarm

ClusterStatusAlarm

Major

The cluster status is abnormal. For example, the cluster is read-only; majority of ETCDs are faulty; or the cluster resources are unevenly distributed.

Contact SRE engineers.

If the cluster status is read-only, only read services are processed.

If the majority of ETCDs are fault, the cluster is unavailable.

If resources are unevenly distributed, the instance performance and reliability deteriorate.

Hardware resource alarm

HardwareResourceAlarm

Major

A major hardware fault occurs in the instance, such as disk damage or GTM network fault.

Contact SRE engineers.

Some or all services are affected.

Status transition alarm

StateTransitionAlarm

Major

The following events occur in the instance: DN build failure, forcible DN promotion, primary/standby DN switchover/failover, or primary/standby GTM switchover/failover.

Wait until the fault is automatically rectified and check whether services are recovered. If no, contact SRE engineers.

Some services are interrupted.

Other abnormal alarm

OtherAbnormalAlarm

Major

Disk usage threshold alarm

Focus on service changes and scale up storage space as needed.

If the used storage space exceeds the threshold, storage space cannot be scaled up.

Faulty DB instance

TaurusInstanceRunningStatusAbnormal

Major

This event is a key alarm event and is reported when an instance is faulty due to a disaster or a server failure.

Submit a service ticket.

The database service may be unavailable.

DB instance recovered

TaurusInstanceRunningStatusRecovered

Major

GaussDB(openGauss) provides an HA tool for automated or manual rectification of faults. After the fault is rectified, this event is reported.

No further action is required.

None

Faulty DB node

TaurusNodeRunningStatusAbnormal

Major

This event is a key alarm event and is reported when a database node is faulty due to a disaster or a server failure.

Check whether the database service is available and submit a service ticket.

The database service may be unavailable.

DB node recovered

TaurusNodeRunningStatusRecovered

Major

GaussDB(openGauss) provides an HA tool for automated or manual rectification of faults. After the fault is rectified, this event is reported.

No further action is required.

None

DB instance creation failure

GaussDBV5CreateInstanceFailed

Major

Instances fail to be created because the quota is insufficient or underlying resources are exhausted.

Release the instances that are no longer used and try to provision them again, or submit a service ticket to adjust the quota.

DB instances cannot be created.

Node adding failure

GaussDBV5ExpandClusterFailed

Major

The underlying resources are insufficient.

Submit a service ticket. The O&M personnel will coordinate resources in the background, and then you delete the node that failed to be added and add a new node.

None

Storage scale-up failure

GaussDBV5EnlargeVolumeFailed

Major

The underlying resources are insufficient.

Submit a service ticket. The O&M personnel will coordinate resources in the background and then you scale up the storage space again.

Services may be interrupted.

Reboot failure

GaussDBV5RestartInstanceFailed

Major

The network is abnormal.

Retry the reboot operation or submit a service ticket to the O&M personnel.

The database service may be unavailable.

Full backup failure

GaussDBV5FullBackupFailed

Major

The backup files fail to be exported or uploaded.

Submit a service ticket to the O&M personnel.

Data cannot be backed up.

Differential backup failure

GaussDBV5DifferentialBackupFailed

Major

The backup files fail to be exported or uploaded.

Submit a service ticket to the O&M personnel.

Data cannot be backed up.

Backup deletion failure

GaussDBV5DeleteBackupFailed

Major

This function does not need to be implemented.

N/A

N/A

EIP binding failure

GaussDBV5BindEIPFailed

Major

The EIP is bound to another resource.

Submit a service ticket to the O&M personnel.

The instance cannot be accessed from the Internet.

EIP unbinding failure

GaussDBV5UnbindEIPFailed

Major

The network is faulty or EIP is abnormal.

Unbind the IP address again or submit a service ticket to the O&M personnel.

IP addresses may be residual.

Parameter template application failure

GaussDBV5ApplyParamFailed

Major

Modifying a parameter template times out.

Modify the parameter template again.

None

Parameter modification failure

GaussDBV5UpdateInstanceParamGroupFailed

Major

Modifying a parameter template times out.

Modify the parameter template again.

None

Backup and restoration failure

GaussDBV5RestoreFromBcakupFailed

Major

The underlying resources are insufficient or backup files fail to be downloaded.

Submit a service ticket.

The database service may be unavailable during the restoration failure.

Failed to upgrade the hot patch

GaussDBV5UpgradeHotfixFailed

Major

Generally, this fault is caused by an error reported during kernel upgrade.

View the error information about the workflow and redo or skip the job.

None

Table 13 Distributed Database Middleware (DDM)

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

DDM

Failed to create a DDM instance

createDdmInstanceFailed

Major

The underlying resources are insufficient.

Release resources and create the instance again.

DDM instances cannot be created.

Failed to change class of a DDM instance

resizeFlavorFailed

Major

The underlying resources are insufficient.

Submit a service ticket to the O&M personnel to coordinate resources and try again.

Services on some nodes are interrupted.

Failed to scale out a DDM instance

enlargeNodeFailed

Major

The underlying resources are insufficient.

Submit a service ticket to the O&M personnel to coordinate resources, delete the node that fails to be added, and add a node again.

The instance fails to be scaled out.

Failed to scale in a DDM instance

reduceNodeFailed

Major

The underlying resources fail to be released.

Submit a service ticket to the O&M personnel to release resources.

The instance fails to be scaled in.

Failed to restart a DDM instance

restartInstanceFailed

Major

The DB instances associated are abnormal.

Check whether DB instances associated are normal. If the instances are normal, submit a service ticket to the O&M personnel.

Services on some nodes are interrupted.

Failed to create a schema

createLogicDbFailed

Major

The possible causes are as follows:

  • The password for the DB instance account is incorrect.
  • The security group of the DDM instance and the associated DB instance are incorrectly configured. As a result, the DDM instance cannot communicate with the associated DB instance.

Check whether

  • The username and password of the DB instance are correct.
  • The security groups associated with the DDM instance and underlying database instance are correctly configured.

Services cannot run properly.

Failed to bind an EIP

bindEipFailed

Major

The EIP is abnormal.

Try again later. In case of emergency, contact O&M personnel to rectify the fault.

The DDM instance cannot be accessed from the Internet.

Failed to scale out a schema

migrateLogicDbFailed

Major

The underlying resources fail to be processed.

Submit a service ticket to the O&M personnel.

The schema cannot be scaled out.

Failed to re-scale out a schema

retryMigrateLogicDbFailed

Major

The underlying resources fail to be processed.

Submit a service ticket to the O&M personnel.

The schema cannot be scaled out.

Table 14 Cloud Phone Server

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

CPH

Server shutdown

cphServerOsShutdown

Major

The cloud phone server was stopped

  • on the management console.
  • by calling APIs.

Deploy service applications in HA mode.

After the fault is rectified, check whether services recover.

Services are interrupted.

Server abnormal shutdown

cphServerShutdown

Major

The cloud phone server was stopped unexpectedly. Possible causes are as follows:

  • The cloud phone server was powered off unexpectedly.
  • The cloud phone server was stopped due to hardware faults.

Deploy service applications in HA mode.

After the fault is rectified, check whether services recover.

Services are interrupted.

Server reboot

cphServerOsReboot

Major

The cloud phone server was rebooted

  • on the management console.
  • by calling APIs.

Deploy service applications in HA mode.

After the fault is rectified, check whether services recover.

Services are interrupted.

Server abnormal reboot

cphServerReboot

Major

The cloud phone server was rebooted unexpectedly due to

  • OS faults.
  • hardware faults.

Deploy service applications in HA mode.

After the fault is rectified, check whether services recover.

Services are interrupted.

Network disconnection

cphServerlinkDown

Major

The network where the cloud phone server was deployed was disconnected. Possible causes are as follows:

  • The cloud phone server was stopped unexpectedly and rebooted.
  • The switch was faulty.
  • The gateway node was faulty.

Deploy service applications in HA mode.

After the fault is rectified, check whether services recover.

Services are interrupted.

PCIe error

cphServerPcieError

Major

The PCIe device or main board on the cloud phone server was faulty.

Deploy service applications in HA mode.

After the fault is rectified, check whether services recover.

The network or disk read/write is affected.

Disk error

cphServerDiskError

Major

The disk on the cloud phone server was faulty due to

  • disk backplane faults.
  • disk faults.

Deploy service applications in HA mode.

After the fault is rectified, check whether services recover.

Data read/write services are affected, or the BMS cannot be started.

Storage error

cphServerStorageError

Major

The cloud phone server could not connect to EVS disks. Possible causes are as follows:

  • SDI card faults
  • Remote storage devices were faulty.

Deploy service applications in HA mode.

After the fault is rectified, check whether services recover.

Data read/write services are affected, or the BMS cannot be started.

GPU offline

cphServerGpuOffline

Major

GPU of the cloud phone server was loose and disconnected.

Stop the cloud phone server and reboot it.

Faults occur on cloud phones whose GPUs are disconnected. Cloud phones cannot run properly even if they are restarted or reconfigured.

GPU timeout

cphServerGpuTimeOut

Major

GPU of the cloud phone server timed out.

Reboot the cloud phone server.

Cloud phones whose GPUs timed out cannot run properly and are still faulty even if they are restarted or reconfigured.

Disk space full

cphServerDiskFull

Major

Disk space of the cloud phone server was used up.

Clear the application data in the cloud phone to release space.

Cloud phone is sub-healthy, prone to failure, and unable to start.

Disk readonly

cphServerDiskReadOnly

Major

The disk of the cloud phone server became read-only.

Reboot the cloud phone server.

Cloud phone is sub-healthy, prone to failure, and unable to start.

Cloud phone metadata damaged

cphPhoneMetaDataDamage

Major

Cloud phone metadata was damaged.

Contact O&M personnel.

The cloud phone cannot run properly even if it is restarted or reconfigured.

GPU failed

gpuAbnormal

Critical

The GPU was faulty.

Submit a service ticket.

Services are interrupted.

GPU recovered

gpuNormal

Informational

The GPU was running properly.

No further action is required.

N/A

Kernel crash

kernelCrash

Critical

The kernel log indicated crash.

Submit a service ticket.

Services are interrupted during the crash.

Kernel OOM

kernelOom

Major

The kernel log indicated out of memory.

Submit a service ticket.

Services are interrupted.

Hardware malfunction

hardwareError

Critical

The kernel log indicated Hardware Error.

Submit a service ticket.

Services are interrupted.

PCIe error

pcieAer

Critical

The kernel log indicated PCIe Bus Error.

Submit a service ticket.

Services are interrupted.

SCSI error

scsiError

Critical

The kernel log indicated SCSI Error.

Submit a service ticket.

Services are interrupted.

Image storage became read-only

partReadOnly

Critical

The image storage became read-only.

Submit a service ticket.

Services are interrupted.

Image storage superblock damaged

badSuperBlock

Critical

The superblock of the file system of the image storage was damaged.

Submit a service ticket.

Services are interrupted.

Image storage /.sharedpath/master became read-only

isuladMasterReadOnly

Critical

Mount point /.sharedpath/master of the image storage became read-only.

Submit a service ticket.

Services are interrupted.

Cloud phone data disk became read-only

cphDiskReadOnly

Critical

The cloud phone data disk became read-only.

Submit a service ticket.

Services are interrupted.

Cloud phone data disk superblock damaged

cphDiskBadSuperBlock

Critical

The superblock of the file system of the cloud phone data disk was damaged.

Submit a service ticket.

Services are interrupted.

Table 15 Layer 2 Connection Gateway (L2CG)

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

L2CG

IP addresses conflicted

IPConflict

Major

A cloud server and an on-premises server that need to communicate use the same IP address.

Check the ARP and switch information to locate the servers that have the same IP address and change the IP address.

The communications between the on-premises and cloud servers may be abnormal.

Table 16 Elastic IP and bandwidth

Event Source

Event Name

Event ID

Event Severity

Elastic IP and bandwidth

VPC deleted

deleteVpc

Major

VPC modified

modifyVpc

Minor

Subnet deleted

deleteSubnet

Minor

Subnet modified

modifySubnet

Minor

Bandwidth modified

modifyBandwidth

Minor

VPN deleted

deleteVpn

Major

VPN modified

modifyVpn

Minor

Table 17 Elastic Volume Service (EVS)

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

EVS

Update disk

updateVolume

Minor

Update the name and description of an EVS disk.

No further action is required.

None

Expand disk

extendVolume

Minor

Expand an EVS disk.

No further action is required.

None

Delete disk

deleteVolume

Major

Delete an EVS disk.

No further action is required.

Deleted disks cannot be recovered.

QoS upper limit reached

reachQoS

Major

The I/O latency increases as the QoS upper limits of the disk are frequently reached and flow control triggered.

Change the disk type to one with a higher specification.

The current disk may fail to meet service requirements.

Table 18 Identity and Access Management (IAM)

Event Source

Event Name

Event ID

Event Severity

IAM

Login

login

Minor

Logout

logout

Minor

Password changed

changePassword

Major

User created

createUser

Minor

User deleted

deleteUser

Major

User updated

updateUser

Minor

User group created

createUserGroup

Minor

User group deleted

deleteUserGroup

Major

User group updated

updateUserGroup

Minor

Identity provider created

createIdentityProvider

Minor

Identity provider deleted

deleteIdentityProvider

Major

Identity provider updated

updateIdentityProvider

Minor

Metadata updated

updateMetadata

Minor

Security policy updated

updateSecurityPolicies

Major

Credential added

addCredential

Major

Credential deleted

deleteCredential

Major

Project created

createProject

Minor

Project updated

updateProject

Minor

Project suspended

suspendProject

Major

Table 19 Key Management Service (KMS)

Event Source

Event Name

Event ID

Event Severity

KMS

Key disabled

disableKey

Major

Key deletion scheduled

scheduleKeyDeletion

Minor

Grant retired

retireGrant

Major

Grant revoked

revokeGrant

Major

Table 20 Object Storage Service (OBS)

Event Source

Event Name

Event ID

Event Severity

OBS

Bucket deleted

deleteBucket

Major

Bucket policy deleted

deleteBucketPolicy

Major

Bucket ACL configured

setBucketAcl

Minor

Bucket policy configured

setBucketPolicy

Minor

Table 21 Cloud Eye

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Cloud Eye

Agent heartbeat interruption

agentHeartbeatInterrupted

Major

The Agent sends a heartbeat message to Cloud Eye every minute. If Cloud Eye cannot receive a heartbeat for 3 minutes, Agent Status is displayed as Faulty.

  • Confirm that the Agent domain name cannot be resolved.
  • Check whether your account is in arrears.
  • The Agent process is faulty. Restart the Agent. If the Agent process is still faulty after the restart, the Agent files may be damaged. In this case, reinstall the Agent.
  • Confirm that the server time is inconsistent with the local standard time.
  • If the DNS server is not a Huawei Cloud DNS server, run the dig domain name command to obtain the IP address of agent.ces.myhuaweicloud.com which is resolved by the Huawei Cloud DNS server over the intranet and then add the IP address into the corresponding hosts file.
  • Update the Agent to the latest version.

Agent back to normal

agentResumed

Informational

The Agent was back to normal.

No further action is required.

Agent faulty

agentFaulty

Major

The Agent was faulty and this status was reported to Cloud Eye.

The Agent process is faulty. Restart the Agent. If the Agent process is still faulty after the restart, the Agent files may be damaged. In this case, reinstall the Agent.

Update the Agent to the latest version.

Agent disconnected

agentDisconnected

Major

The Agent sends a heartbeat message to Cloud Eye every minute. If Cloud Eye cannot receive a heartbeat for 3 minutes, Agent Status is displayed as Faulty.

Confirm that the Agent domain name cannot be resolved.

Check whether your account is in arrears.

The Agent process is faulty. Restart the Agent. If the Agent process is still faulty after the restart, the Agent files may be damaged. In this case, reinstall the Agent.

Confirm that the server time is inconsistent with the local standard time.

If the DNS server is not a Huawei Cloud DNS server, run the dig domain name command to obtain the IP address of agent.ces.myhuaweicloud.com which is resolved by the Huawei Cloud DNS server over the intranet, and then add the IP address into the corresponding hosts file. Update the Agent to the latest version.

Table 22 DataSpace

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

Data Space

New revision

newRevision

Minor

An updated version was released.

After receiving the notification, export the data of the updated version as required.

None.

Table 23 Enterprise Switch

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

Enterprise Switch

IP addresses conflicted

IPConflict

Major

A cloud server and an on-premises server that need to communicate use the same IP address.

Check the ARP and switch information to locate the servers that have the same IP address and change the IP address.

The communications between the on-premises and cloud servers may be abnormal.

Table 24 Cloud Secret Management Service (CSMS)

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

CSMS

Operation on secret scheduled for deletion

operateDeletedSecret

Major

A user attempts to perform operations on a secret that is scheduled to be deleted.

Check whether the scheduled secret deletion needs to be canceled.

The user cannot perform operations on the secret scheduled to be deleted.

Table 25 Distributed Cache Service (DCS)

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

DCS

Full sync retry during online migration

migrationFullResync

Minor

If online migration fails, full synchronization will be triggered because incremental synchronization cannot be performed.

Check whether full sync retries are triggered repeatedly. Check whether the source instance is connected and whether it is overloaded. If full sync retries are triggered repeatedly, contact O&M personnel.

The migration task is disconnected from the source instance, triggering another full sync. As a result, the CPU usage of the source instance may increase sharply.

Automatic failover

masterStandbyFailover

Minor

The master node was abnormal, promoting a replica to master.

Check whether services can recover by themselves. If applications cannot recover, restart them.

Persistent connections to the instance are interrupted.

Memcached master/standby switchover

memcachedMasterStandbyFailover

Minor

The master node was abnormal, promoting the standby node to master.

Check whether services can recover by themselves. If applications cannot recover, restart them.

Persistent connections to the instance are interrupted.

Redis server abnormal

redisNodeStatusAbnormal

Major

The Redis server status was abnormal.

Check whether services are affected. If yes, contact O&M personnel.

If the master node is abnormal, an automatic failover is performed. If a standby node is abnormal and the client directly connects to the standby node for read/write splitting, no data can be read.

Redis server recovered

redisNodeStatusNormal

Major

The Redis server status recovered.

Check whether services can recover. If the applications are not reconnected, restart them.

Recover from an exception.

Sync failure in data migration

migrateSyncDataFail

Major

Online migration failed.

Reconfigure the migration task and migrate data again. If the fault persists, contact O&M personnel.

Data migration fails.

Memcached instance abnormal

memcachedInstanceStatusAbnormal

Major

The Memcached node status was abnormal.

Check whether services are affected. If yes, contact O&M personnel.

The Memcached instance is abnormal and may not be accessed.

Memcached instance recovered

memcachedInstanceStatusNormal

Major

The Memcached node status recovered.

Check whether services can recover. If the applications are not reconnected, restart them.

Recover from an exception.

Instance backup failure

instanceBackupFailure

Major

The DCS instance fails to be backed up due to an OBS access failure.

Retry backup manually.

Automatic backup fails.

Instance node abnormal restart

instanceNodeAbnormalRestart

Major

DCS nodes restarted unexpectedly when they became faulty.

Check whether services can recover. If the applications are not reconnected, restart them.

Persistent connections to the instance are interrupted.

Long-running Lua scripts stopped

scriptsStopped

Informational

Lua scripts that had timed out automatically stopped running.

Optimize Lua scrips to prevent execution timeout.

If Lua scripts take a long time to execute, they will be forcibly stopped to avoid blocking the entire instance.

Node restarted

nodeRestarted

Informational

After write operations had been performed, the node automatically restarted to stop Lua scripts that had timed out.

Check whether services can recover by themselves. If applications cannot recover, restart them.

Persistent connections to the instance are interrupted.

Table 26 Intelligent Cloud Access (ICA)

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

ICA

BGP peer disconnection

BgpPeerDisconnection

Major

The BGP peer is disconnected.

Log in to the gateway and locate the cause.

Service traffic may be interrupted.

BGP peer connection success

BgpPeerConnectionSuccess

Major

The BGP peer is successfully connected.

None

None

Abnormal GRE tunnel status

AbnormalGreTunnelStatus

Major

The GRE tunnel status is abnormal.

Log in to the gateway and locate the cause.

Service traffic may be interrupted.

Normal GRE tunnel status

NormalGreTunnelStatus

Major

The GRE tunnel status is normal.

None

None

WAN interface goes up

EquipmentWanGoingOnline

Major

The WAN interface goes online.

None

None

WAN interface goes down

EquipmentWanGoingOffline

Major

The WAN interface goes offline.

Check whether the event is caused by a manual operation or device fault.

The device cannot be used.

Intelligent enterprise gateway going online

IntelligentEnterpriseGatewayGoingOnline

Major

The intelligent enterprise gateway goes online.

None

None

Intelligent enterprise gateway going offline

IntelligentEnterpriseGatewayGoingOffline

Major

The intelligent enterprise gateway goes offline.

Check whether the event is caused by a manual operation or device fault.

The device cannot be used.

Table 27 Multi-Site High Availability Service (MAS)

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

MAS

Abnormal database instance

dbError

Major

Abnormal database instance is detected by MAS.

Log in to the MAS console to view the cause and rectify the fault.

Services are interrupted.

Database instance recovered

dbRecovery

Major

The database instance is recovered.

None

Services are interrupted.

Abnormal Redis instance

redisError

Major

Abnormal Redis instance is detected by MAS.

Log in to the MAS console to view the cause and rectify the fault.

Services are interrupted.

Redis instance recovered

redisRecovery

Major

The Redis instance is recovered.

None

Services are interrupted.

Abnormal MongoDB database

mongodbError

Major

Abnormal MongoDB database is detected by MAS.

Log in to the MAS console to view the cause and rectify the fault.

Services are interrupted.

MongoDB database recovered

mongodbRecovery

Major

The MongoDB database is recovered.

None

Services are interrupted.

Abnormal Elasticsearch instance

esError

Major

Abnormal Elasticsearch instance is detected by MAS.

Log in to the MAS console to view the cause and rectify the fault.

Services are interrupted.

Elasticsearch instance recovered

esRecovery

Major

The Elasticsearch instance is recovered.

None

Services are interrupted.

Abnormal API

apiError

Major

The abnormal API is detected by MAS.

Log in to the MAS console to view the cause and rectify the fault.

Services are interrupted.

API recovered

apiRecovery

Major

The API is recovered.

None

Services are interrupted.

Area status changed

netChange

Major

Area status changes are detected by MAS.

Log in to the MAS console to view the cause and rectify the fault.

Network of the multi-active areas may change.

Table 28 Config

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

Config

Configuration noncompliance notification

configurationNoncomplianceNotification

Major

The assignment evaluation result is Non-compliant.

Modify the noncompliant configuration items of the resource.

None

Configuration compliance notification

configurationComplianceNotification

Informational

The assignment evaluation result changed to be Compliant.

None

None

Table 29 SecMaster

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

SecMaster

Exclusive engine creation failed

createEngineFailed

Major

The underlying resources are insufficient.

Submit a ticket to request sufficient resources from the O&M personnel and try again.

The exclusive engine cannot be created.

Exclusive engine exception

engineException

Critical

The traffic is too heavy or there are malicious processes or plug-ins.

  1. Check the executions of plug-ins and processes, see if they occupy too many resources.
  2. Check the instance monitoring information to see whether there is a sharp increase in the number of instances.

The instance cannot be executed.

Playbook instance execution failed

playbookInstanceExecFailed

Minor

Playbooks or processes are incorrectly configured.

Check the instance monitoring information to find the cause of the failure, and modify the playbook and process configuration.

None

Playbook instance increased sharply

playbookInstanceIncreaseSharply

Minor

Playbooks or processes are incorrectly configured.

Check the instance monitoring information to find the cause of the increase, and modify the playbook and process configuration.

None

Log messages increased sharply

logIncrease

Major

The upstream services suddenly generate a large number of log messages.

Check whether the upstream services are normal.

None

Log messages decreased sharply

logsDecrease

Major

Logs generated by the upstream services suddenly decrease.

Check whether the upstream services are normal.

None

Table 30 Key Pair Service

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

KPS

Key pair deleted

KPSDeleteKeypair

Informational

A key pair was deleted. This operation cannot be undone.

If this event occurred frequently within a short period of time, check whether malicious deletion took place.

Deleted key pairs cannot be restored.

Table 31 Host Security Service

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

HSS

HSS agent disconnected

hssAgentAbnormalOffline

Major

The communication between the agent and the server is abnormal, or the agent process on the server is abnormal.

Fix your network connection. If the agent is still offline for a long time after the network recovers, the agent process may be abnormal. In this case, log in to the server and restart the agent process.

Services are interrupted.

Abnormal HSS agent status

hssAgentAbnormalProtection

Major

The agent is abnormal probably because it does not have sufficient resources.

Log in to the server and check your resources. If the usage of memory or other system resources is too high, increase their capacity first. If the resources are sufficient but the fault persists after the agent process is restarted, submit a service ticket to the O&M personnel.

Services are interrupted.

Table 32 Image Management Service

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

IMS

Create Image

createImage

Major

An image was created.

None

You can use this image to create cloud servers.

Update Image

updateImage

Major

Metadata of an image was modified.

None

Cloud servers may fail to be created from this image.

Delete Image

deleteImage

Major

An image was deleted.

None

This image will be unavailable on the management console.

Table 33 Cloud Storage Gateway (CSG)

Event Source

Event Name

Event ID

Event Severity

Description

CSG

Abnormal CSG process status

gatewayProcessStatusAbnormal

Major

This event is triggered when an exception occurs in the CSG process status.

Abnormal CSG connection status

gatewayToServiceConnectAbnormal

Major

This event is triggered when no CSG status report is returned for five consecutive periods.

Abnormal connection status between CSG and OBS

gatewayToObsConnectAbnormal

Major

This event is triggered when CSG cannot connect to OBS.

Read-only file system

gatewayFileSystemReadOnly

Major

This event is triggered when the partition file system on CSG becomes read-only.

Read-only file share

gatewayFileShareReadOnly

Major

This event is triggered when the file share becomes read-only due to insufficient cache disk storage space.

Table 34 Global Accelerator (GA)

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

GA

Anycast IP address blocked

blockAIP

Critical

The used bandwidth of an EIP exceeded 5 Gbit/s, the EIP were blocked and packets were discarded. Such an event may be caused by DDoS attacks.

Locate the root cause and rectify the fault.

Services are affected. The traffic will not be properly forwarded.

Anycast IP address unblocked

unblockAIP

Critical

The anycast IP address was unblocked.

Ensure that traffic can be properly forwarded.

None

Unhealthy endpoint

healthCheckError

Major

Health check detects the endpoint unhealthy.

Perform operations as described in What Should I Do If an Endpoint Is Unhealthy? If the endpoint is still unhealthy, submit a service ticket.

If an endpoint is considered unhealthy, traffic will not be forwarded to it until the endpoint recovers.

Table 35 Enterprise connection

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

EC

WAN interface goes up

EquipmentWanGoesOnline

Major

The WAN interface goes online.

None

None

WAN interface goes down

EquipmentWanGoesOffline

Major

The WAN interface goes offline.

Check whether the event is caused by a manual operation or device fault.

The device cannot be used.

BGP peer disconnection

BgpPeerDisconnection

Major

BGP peer disconnection

Check whether the event is caused by a manual operation or device fault.

The device cannot be used.

BGP peer connection success

BgpPeerConnectionSuccess

Major

The BGP peer is successfully connected.

None

None

Abnormal GRE tunnel status

AbnormalGreTunnelStatus

Major

Abnormal GRE tunnel status

Check whether the event is caused by a manual operation or device fault.

The device cannot be used.

Normal GRE tunnel status

NormalGreTunnelStatus

Major

The GRE tunnel status is normal.

None

None

Intelligent enterprise gateway going online

IntelligentEnterpriseGatewayGoesOnline

Major

The intelligent enterprise gateway goes online.

None

None

Intelligent enterprise gateway going offline

IntelligentEnterpriseGatewayGoesOffline

Major

The intelligent enterprise gateway goes offline.

Check whether the event is caused by a manual operation or device fault.

The device cannot be used.

Table 36 Cloud Certificate Manager (CCM)

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

CCM

Certificate revocation

CCMRevokeCertificate

Major

The certificate enters into the revocation process. Once revoked, the certificate cannot be used anymore.

Check whether the certificate revocation is really needed. Certificate revocation can be canceled.

If a certificate is revoked, the website is inaccessible using HTTPS.

Certificate auto-deployment failure

CCMAutoDeploymentFailure

Major

The certificate fails to be automatically deployed.

Check service resources whose certificates need to be replaced.

If no new certificate is deployed after a certificate expires, the website is inaccessible using HTTPS.

Certificate expiration

CCMCertificateExpiration

Major

An SSL certificate has expired.

Purchase a new certificate in a timely manner.

If no new certificate is deployed after a certificate expires, the website is inaccessible using HTTPS.

Certificate about to expire

CCMcertificateAboutToExpiration

Major

This alarm is generated when an SSL certificate is about to expire in one week, one month, and two months.

Renew or purchase a new certificate in a timely manner.

If no new certificate is deployed after a certificate expires, the website is inaccessible using HTTPS.