Alarm Configurations
CCE interworks with Application Operations Management (AOM) to report alarms and events. By setting alarm rules on AOM, you can check whether resources in clusters are normal in a timely manner.
Recommended Alarm Configurations
Table 1 lists the alarms recommended to configure.
| Resource Type | Monitoring Item | Description | Recommended Trigger Condition | Configuration Method |
|---|---|---|---|---|
| Cluster | Abnormal Node Status | This metric is used to check abnormal node status. | Immediate triggering (that is, an alarm is generated when a node is abnormal) | |
| CPU Usage | This metric is used to calculate the CPU usage of the measured object. | Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3 | ||
| Disk Usage | This metric is used to calculate the percentage of the in-use disk space to the total disk space. | Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3 | ||
| Physical Memory Usage | This metric is used to calculate the percentage of the physical memory used by the measured object to the total physical memory. | Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3 | ||
| Virtual Memory Usage | This metric is used to calculate the percentage of the virtual memory used by the measured object to the total virtual memory. | Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3 | ||
| Host | CPU Usage | This metric is used to calculate the CPU usage of the measured object. | Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3 | |
| Physical Memory Size | This metric is used to calculate the available physical memory used by the measured object. | Threshold condition: = 0; statistical period (minutes): 1; consecutive periods: 3 | ||
| Available Virtual Memory | This metric is used to calculate the available virtual memory of the measured object. | Threshold condition: = 0; statistical period (minutes): 1; consecutive periods: 3 | ||
| Physical Memory Usage | This metric is used to calculate the percentage of the physical memory used by the measured object to the total physical memory. | Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3 | ||
| Virtual Memory Usage | This metric is used to calculate the percentage of the virtual memory used by the measured object to the total virtual memory. | Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3 | ||
| Host–network | Received Error Packet Rate | This metric is used to calculate the number of error packets received by an NIC per second. | Threshold condition: > 0; statistical period (minutes): 1; consecutive periods: 3 | |
| Send Error Packet Rate | This metric is used to calculate the number of error packets sent by an NIC per second. | Threshold condition: > 0; statistical period (minutes): 1; consecutive periods: 3 | ||
| Host–file system | Disk Usage | This metric is used to calculate the percentage of the in-use disk space to the total disk space. | Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3 | |
| Disk read/write status | This metric is used to collect statistics on the read and write status of disks on a host. | Threshold condition: >= 1; statistical period (minutes): 1; consecutive periods: 1 | ||
| Workload | Workload Status | This metric is used to check abnormal workload status. | Threshold condition: >= 1; statistical period (minutes): 1; consecutive periods: 1 | |
| CPU Usage | This metric is used to calculate the CPU usage of the measured object, namely, the ratio of the CPU cores actually used by the measured object to the total CPU cores that the measured object has applied for. | Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3 | ||
| Physical Memory Usage | This metric is used to calculate the percentage of the physical memory used by the measured object to the total physical memory. | Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3 | ||
| File System Usage | This metric is used to calculate the file system usage of a measured object, that is, the percentage of the used file system to the total file system. This metric is supported only for the containers using the device mapper in the Kubernetes cluster of version 1.11 or later. | Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3 | ||
| Pod | CPU Usage | This metric is used to calculate the CPU usage of the measured object. Namely, the ratio of the CPU cores actually used by the measured object to the total CPU cores that the measured object has applied for. | Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3 | |
| File System Usage | This metric is used to calculate the file system usage of a measured object, that is, the percentage of the used file system to the total file system. This metric is supported only for the containers using the device mapper in the Kubernetes cluster of version 1.11 or later. | Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3 | ||
| Physical Memory Usage | This metric is used to calculate the percentage of the physical memory used by the measured object to the total physical memory. | Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3 | ||
| Received Error Packet Rate | This metric is used to calculate the number of error packets received by an NIC per second. | Threshold condition: > 0; statistical period (minutes): 1; consecutive periods: 3 | ||
| Error Packets Received | This metric is used to calculate the number of error packets received by a measured object | Threshold condition: > 0; statistical period (minutes): 1; consecutive periods: 3 | ||
| Send Error Packet Rate | This metric is used to calculate the number of error packets sent by an NIC per second. | Threshold condition: > 0; statistical period (minutes): 1; consecutive periods: 3 | ||
| Container Status | This metric is used to check whether the status of the Docker container is normal. | Threshold condition: >= 1; statistical period (minutes): 1; consecutive periods: 1 |
Creating a Topic on SMN
Simple Message Notification (SMN) pushes messages to subscribers through emails, SMS messages, and HTTP/HTTPS requests.
A topic is used to publish messages and subscribe to notifications. It serves as a message transmission channel between publishers and subscribers.
You need to create a topic and subscribe to it. For details, see Creating a Topic and Subscribing to a Topic.
Creating an Action Policy
AOM allows you to customize alarm action policies. You can create an alarm action policy to associate an SMN topic and a message template. You can also customize notification content by using a message template.
For details, see Creating Alarm Action Policies. When creating an action policy, select the topic that is created and subscribed to in Creating a Topic on SMN.
Adding Event Alarms
The following uses the NodeNotReady alarm as an example to describe how to add an event alarm.
This function is provided by AOM. For details about parameters, see Creating Event Alarm Rules.
- Log in to the AOM console.
- In the navigation pane, choose Alarm Center > Alarm Rules and click Add Alarm.
- Set an alarm rule.
- Rule Type: Select Event alarms.
- Alarm Source: Select CCE.
- Trigger Object: Select NodeNotReady based on the event name. You can filter trigger objects by notification type, event name, alarm severity, custom attribute, namespace, and cluster name.
- Triggering Mode: Select Immediate Triggering.
- Alarm Mode: Select Direct Alarm Reporting.
- Action Policy: Select the action policy created in Creating an Action Policy.
The meaning of the alarm rule is as follows:
If a node in the cluster becomes abnormal, CCE reports the NodeNotReady event to AOM. Based on the configured alarm rule, AOM triggers an alarm notification immediately when the NodeNotReady event occurs and notifies you through SMN based on the action policy.
Figure 1 Creating an event alarm
- Click Create Now.
If the following information is displayed in the rule list, the rule is created successfully.

Event alarms are generated based on the events reported by CCE to AOM. CCE reports a series of events to AOM. You can view specific events in the Alarm Rule Settings areas and add event alarms as required.
The following events are supported:
- FailedCreate: Creation failed.
- FailedUpdate: Update failed.
- FailedRollback: Rollback failed.
- FailedDelete: Deletion failed.
- FailedScaleIn: Scale-in failed.
- FailedScaleOut: Scale-out failed.
- FailedStandBy: Failed to enter the standby state.
- FailedActive: Activation failed.
- FailedRestart: Restart failed.
- FailedStart: Startup failed.
- BackOffStart: Start retry failed.
- FailedReconfig: Failed to update the configurations.
- Unhealthy: Abnormal status.
- FailedScheduling: Scheduling failed.
- FailedPullImage: Image pull failed.
- BackOffPullImage: Image pull retry failed.
- ErrImageNeverPull: Image not pulled.
- NodeNotReady: Abnormal node.
- NodeHasInsufficientMemory: Insufficient node memory.
- NodeHasDiskPressure: Insufficient node disk space.
- NodeOutOfDisk: Full node disk space.
- NodeNotSchedulable: Node not schedulable.
- NetworkCardNotFound: NIC not found.
- CNIIsDown: Faulty node CNI plug-in.
- DOCKERIsDown: Abnormal Docker on the node.
- KUBELETIsDown: Abnormal kubelet on the node.
- KUBEPROXYIsDown: Abnormal kube-proxy on the node.
- NTPIsDown: Abnormal NTP service on the node.
- NodeCreateFailed: Node creation failed.
- NodeInstallFailed: Node registration failed.
- Rebooted: Node rebooted.
- CIDRNotAvailable: CIDR unavailable.
- CIDRAssignmentFailed: CIDR allocation failed.
- ConntrackFull: Full connection tracking table of the node.
- OOMKilling: Container or pod terminated because it used more memory than allowed.
- TaskHung: Node task suspended.
- UnregisterNetDevice: Unregistered network devices detected on the node.
- KernelOops: Faulty node OS kernel.
- AUFSUmountHung: Node disk unmounting suspended.
- DockerHung: The node Docker hung up.
- FilesystemIsReadOnly: Read-only node file system.
- NodeUninstallFailed: Node uninstall failed.
- SelectingAll: Abnormal label selector.
- DeploymentRollbackRevisionNotFound: Deployment rollback version not found.
- ReplicaSetCreateError: ReplicaSet creation failed.
- SelectorOverlap: Label selectors conflicted.
- CreatingLoadBalancerFailed: Load balancer creation failed.
- UpdateLoadBalancerFailed: Load balancer update failed.
- DeletingLoadBalancerFailed: Load balancer deletion failed.
- VolumeFailedRecycle: Data volume reclamation failed.
- VolumeFailedDelete: Data volume deletion failed.
- VolumeUnknownReclaimPolicy: Volume reclamation policy unknown.
- ClaimLost: PVC lost.
- ClaimMisbound: Volume incorrectly bound.
- ProvisioningFailed: Volume creation failed.
- ProvisioningCleanupFailed: Volume cleanup failed.
- FailedGet: Query failed.
- FailedList: Pod list query failed.
- UnexpectedJob: Unknown job.
- TooManyActivePods: Too many active pods.
- TooManySucceededPods: Too many successful pods.
- NotTriggerScaleUp: Node scale-out not triggered.
- TriggerScaleUp: Node scale-out triggered.
- ScaleDownFailed: Node scale-in failed.
- ScaleDown: Node scale-in executed.
- FailedToScaleUpGroup: Nodes failed to be added to a node pool.
- StartScaledUpGroup: Nodes are being added to a node pool.
- ScaledUpGroup: Nodes are successfully added to a node pool.
- ScaleUpTimedOut: Node scale-out timed out.
- ScaleUpFailed: Node scale-out failed.
- NodeGroupInBackOff: Node pool in backoff state.
- ScaleDownEmpty: Idle nodes deleted.
- StartScaleDownEmpty: Idle node deletion started.
- DeleteUnregistered: Unregistered node deleted.
- DeleteUnregisteredFailed: Deletion of unregistered node failed.
- FixNodeGroupSizeError: Failed to restore the number of nodes in a node pool.
- NodePoolSoldOut: Node pool sold out.
- FixNodeGroupSizeDone: Number of nodes in a node pool restored.
- NodePoolAvailable: Node pool resources sufficient.
- VolumeResizeFailed: Data volume capacity expansion failed.
- AttachVolumeFailed: Failed to mount the block storage to the host.
- DetachVolumeFailed: Failed to unmount the block storage from the host.
- WaitForAttachVolumeFailed: Failed to wait for the host to mount the block storage.
- MountDeviceFailed: Drive letter mount failed.
- UnmountDeviceFailed: Drive letter unmount failed.
- SetUpAtVolumeFailed: Data volume mount failed.
- SetUpAtVolumeFailed: Data volume unmount failed.
- DeleteNodeWithNoServer: Discarded nodes deleted.
Adding Threshold Alarms
The following uses the Workload CPU Usage alarm as an example to describe how to add a threshold-based alarm. You can also use this method to add other threshold alarms.
This function is provided by AOM. For details, see Customizing Static Threshold Rules.
- Log in to the AOM console.
- In the navigation pane, choose Alarm Center > Alarm Rules and click Add Alarm.
- Set an alarm rule.
- Rule Type: Select Threshold Rule.
- Monitored Object: Click Select resource objects, set Add By to Dimension, and select CCE/Deployment/CPU Usage for Metric Name. You can filter resources by multiple dimensions as required.

- Alarm Condition: Set parameters such as the statistical period, consecutive times, and threshold conditions as required.

- Triggering Mode: Select Immediate Triggering.
- Alarm Mode: Select Direct Alarm Reporting.
- Action Policy: Select the action policy created in Creating an Action Policy.
- Click Create Now.
If the following information is displayed in the rule list, the rule is created successfully. In this example, there are multiple workloads because no workload is specified in the filter criteria. Therefore, all workloads in the cluster are displayed.

Last Article: Container Logs
Next Article: Namespaces
Did this article solve your problem?
Thank you for your score!Your feedback would help us improve the website.