Help Center> Cloud Container Engine> User Guide> Old Console> Monitoring, Alarms, and Logs> Alarm Configurations

Alarm Configurations

CCE interworks with Application Operations Management (AOM) to report alarms and events. By setting alarm rules on AOM, you can check whether resources in clusters are normal in a timely manner.

Recommended Alarm Configurations

Table 1 lists the alarms recommended to configure.

**Table 1** Recommended alarm configurations
Resource Type	Monitoring Item	Description	Recommended Trigger Condition	Configuration Method
Cluster	Abnormal Node Status	This metric is used to check abnormal node status.	Immediate triggering (that is, an alarm is generated when a node is abnormal)	Adding Event Alarms
	CPU Usage	This metric is used to calculate the CPU usage of the measured object.	Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3	Adding Threshold Alarms
	Disk Usage	This metric is used to calculate the percentage of the in-use disk space to the total disk space.	Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3
	Physical Memory Usage	This metric is used to calculate the percentage of the physical memory used by the measured object to the total physical memory.	Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3
	Virtual Memory Usage	This metric is used to calculate the percentage of the virtual memory used by the measured object to the total virtual memory.	Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3
Host	CPU Usage	This metric is used to calculate the CPU usage of the measured object.	Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3
	Physical Memory Size	This metric is used to calculate the available physical memory used by the measured object.	Threshold condition: = 0; statistical period (minutes): 1; consecutive periods: 3
	Available Virtual Memory	This metric is used to calculate the available virtual memory of the measured object.	Threshold condition: = 0; statistical period (minutes): 1; consecutive periods: 3
	Physical Memory Usage	This metric is used to calculate the percentage of the physical memory used by the measured object to the total physical memory.	Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3
	Virtual Memory Usage	This metric is used to calculate the percentage of the virtual memory used by the measured object to the total virtual memory.	Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3
Host–network	Received Error Packet Rate	This metric is used to calculate the number of error packets received by an NIC per second.	Threshold condition: > 0; statistical period (minutes): 1; consecutive periods: 3
Host–network	Send Error Packet Rate	This metric is used to calculate the number of error packets sent by an NIC per second.	Threshold condition: > 0; statistical period (minutes): 1; consecutive periods: 3
Host–file system	Disk Usage	This metric is used to calculate the percentage of the in-use disk space to the total disk space.	Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3
Host–file system	Disk read/write status	This metric is used to collect statistics on the read and write status of disks on a host.	Threshold condition: >= 1; statistical period (minutes): 1; consecutive periods: 1
Workload	Workload Status	This metric is used to check abnormal workload status.	Threshold condition: >= 1; statistical period (minutes): 1; consecutive periods: 1
	CPU Usage	This metric is used to calculate the CPU usage of the measured object, namely, the ratio of the CPU cores actually used by the measured object to the total CPU cores that the measured object has applied for.	Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3
	Physical Memory Usage	This metric is used to calculate the percentage of the physical memory used by the measured object to the total physical memory.	Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3
	File System Usage	This metric is used to calculate the file system usage of a measured object, that is, the percentage of the used file system to the total file system. This metric is supported only for the containers using the device mapper in the Kubernetes cluster of version 1.11 or later.	Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3
Pod	CPU Usage	This metric is used to calculate the CPU usage of the measured object. Namely, the ratio of the CPU cores actually used by the measured object to the total CPU cores that the measured object has applied for.	Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3
	File System Usage	This metric is used to calculate the file system usage of a measured object, that is, the percentage of the used file system to the total file system. This metric is supported only for the containers using the device mapper in the Kubernetes cluster of version 1.11 or later.	Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3
	Physical Memory Usage	This metric is used to calculate the percentage of the physical memory used by the measured object to the total physical memory.	Threshold condition: > 85%; statistical period (minutes): 1; consecutive periods: 3
	Received Error Packet Rate	This metric is used to calculate the number of error packets received by an NIC per second.	Threshold condition: > 0; statistical period (minutes): 1; consecutive periods: 3
	Error Packets Received	This metric is used to calculate the number of error packets received by a measured object	Threshold condition: > 0; statistical period (minutes): 1; consecutive periods: 3
	Send Error Packet Rate	This metric is used to calculate the number of error packets sent by an NIC per second.	Threshold condition: > 0; statistical period (minutes): 1; consecutive periods: 3
	Container Status	This metric is used to check whether the status of the Docker container is normal.	Threshold condition: >= 1; statistical period (minutes): 1; consecutive periods: 1

Creating a Topic on SMN

Simple Message Notification (SMN) pushes messages to subscribers through emails, SMS messages, and HTTP/HTTPS requests.

A topic is used to publish messages and subscribe to notifications. It serves as a message transmission channel between publishers and subscribers.

You need to create a topic and subscribe to it. For details, see Creating a Topic and Subscribing to a Topic.

Creating an Action Policy

AOM allows you to customize alarm action policies. You can create an alarm action policy to associate an SMN topic and a message template. You can also customize notification content by using a message template.

For details, see Creating Alarm Action Policies. When creating an action policy, select the topic that is created and subscribed to in Creating a Topic on SMN.

Adding Event Alarms

The following uses the NodeNotReady alarm as an example to describe how to add an event alarm.

This function is provided by AOM. For details about parameters, see Creating Event Alarm Rules.

Log in to the AOM console.
In the navigation pane, choose Alarm Center > Alarm Rules and click Add Alarm.
Set an alarm rule.
- Rule Type: Select Event alarms.
- Alarm Source: Select CCE.
- Trigger Object: Select NodeNotReady based on the event name. You can filter trigger objects by notification type, event name, alarm severity, custom attribute, namespace, and cluster name.
- Triggering Mode: Select Immediate Triggering.
- Alarm Mode: Select Direct Alarm Reporting.
- Action Policy: Select the action policy created in Creating an Action Policy.
The meaning of the alarm rule is as follows:

If a node in the cluster becomes abnormal, CCE reports the NodeNotReady event to AOM. Based on the configured alarm rule, AOM triggers an alarm notification immediately when the NodeNotReady event occurs and notifies you through SMN based on the action policy.

Figure 1 Creating an event alarm
Click Create Now.

If the following information is displayed in the rule list, the rule is created successfully.

Event alarms are generated based on the events reported by CCE to AOM. CCE reports a series of events to AOM. You can view specific events in the Alarm Rule Settings areas and add event alarms as required.

Figure 2 Events reported by CCE

The following events are supported:

FailedCreate: Creation failed.
FailedUpdate: Update failed.
FailedRollback: Rollback failed.
FailedDelete: Deletion failed.
FailedScaleIn: Scale-in failed.
FailedScaleOut: Scale-out failed.
FailedStandBy: Failed to enter the standby state.
FailedActive: Activation failed.
FailedRestart: Restart failed.
FailedStart: Startup failed.
BackOffStart: Start retry failed.
FailedReconfig: Failed to update the configurations.
Unhealthy: Abnormal status.
FailedScheduling: Scheduling failed.
FailedPullImage: Image pull failed.
BackOffPullImage: Image pull retry failed.
ErrImageNeverPull: Image not pulled.
NodeNotReady: Abnormal node.
NodeHasInsufficientMemory: Insufficient node memory.
NodeHasDiskPressure: Insufficient node disk space.
NodeOutOfDisk: Full node disk space.
NodeNotSchedulable: Node not schedulable.
NetworkCardNotFound: NIC not found.
CNIIsDown: Faulty node CNI plug-in.
DOCKERIsDown: Abnormal Docker on the node.
KUBELETIsDown: Abnormal kubelet on the node.
KUBEPROXYIsDown: Abnormal kube-proxy on the node.
NTPIsDown: Abnormal NTP service on the node.
NodeCreateFailed: Node creation failed.
NodeInstallFailed: Node registration failed.
Rebooted: Node rebooted.
CIDRNotAvailable: CIDR unavailable.
CIDRAssignmentFailed: CIDR allocation failed.
ConntrackFull: Full connection tracking table of the node.
OOMKilling: Container or pod terminated because it used more memory than allowed.
TaskHung: Node task suspended.
UnregisterNetDevice: Unregistered network devices detected on the node.
KernelOops: Faulty node OS kernel.
AUFSUmountHung: Node disk unmounting suspended.
DockerHung: The node Docker hung up.
FilesystemIsReadOnly: Read-only node file system.
NodeUninstallFailed: Node uninstall failed.
SelectingAll: Abnormal label selector.
DeploymentRollbackRevisionNotFound: Deployment rollback version not found.
ReplicaSetCreateError: ReplicaSet creation failed.
SelectorOverlap: Label selectors conflicted.
CreatingLoadBalancerFailed: Load balancer creation failed.
UpdateLoadBalancerFailed: Load balancer update failed.
DeletingLoadBalancerFailed: Load balancer deletion failed.
VolumeFailedRecycle: Data volume reclamation failed.
VolumeFailedDelete: Data volume deletion failed.
VolumeUnknownReclaimPolicy: Volume reclamation policy unknown.
ClaimLost: PVC lost.
ClaimMisbound: Volume incorrectly bound.
ProvisioningFailed: Volume creation failed.
ProvisioningCleanupFailed: Volume cleanup failed.
FailedGet: Query failed.
FailedList: Pod list query failed.
UnexpectedJob: Unknown job.
TooManyActivePods: Too many active pods.
TooManySucceededPods: Too many successful pods.
NotTriggerScaleUp: Node scale-out not triggered.
TriggerScaleUp: Node scale-out triggered.
ScaleDownFailed: Node scale-in failed.
ScaleDown: Node scale-in executed.
FailedToScaleUpGroup: Nodes failed to be added to a node pool.
StartScaledUpGroup: Nodes are being added to a node pool.
ScaledUpGroup: Nodes are successfully added to a node pool.
ScaleUpTimedOut: Node scale-out timed out.
ScaleUpFailed: Node scale-out failed.
NodeGroupInBackOff: Node pool in backoff state.
ScaleDownEmpty: Idle nodes deleted.
StartScaleDownEmpty: Idle node deletion started.
DeleteUnregistered: Unregistered node deleted.
DeleteUnregisteredFailed: Deletion of unregistered node failed.
FixNodeGroupSizeError: Failed to restore the number of nodes in a node pool.
NodePoolSoldOut: Node pool sold out.
FixNodeGroupSizeDone: Number of nodes in a node pool restored.
NodePoolAvailable: Node pool resources sufficient.
VolumeResizeFailed: Data volume capacity expansion failed.
AttachVolumeFailed: Failed to mount the block storage to the host.
DetachVolumeFailed: Failed to unmount the block storage from the host.
WaitForAttachVolumeFailed: Failed to wait for the host to mount the block storage.
MountDeviceFailed: Drive letter mount failed.
UnmountDeviceFailed: Drive letter unmount failed.
SetUpAtVolumeFailed: Data volume mount failed.
SetUpAtVolumeFailed: Data volume unmount failed.
DeleteNodeWithNoServer: Discarded nodes deleted.

Adding Threshold Alarms

The following uses the Workload CPU Usage alarm as an example to describe how to add a threshold-based alarm. You can also use this method to add other threshold alarms.

This function is provided by AOM. For details, see Customizing Static Threshold Rules.

Log in to the AOM console.
In the navigation pane, choose Alarm Center > Alarm Rules and click Add Alarm.
Set an alarm rule.
- Rule Type: Select Threshold Rule.
- Monitored Object: Click Select resource objects, set Add By to Dimension, and select CCE/Deployment/CPU Usage for Metric Name. You can filter resources by multiple dimensions as required.
- Alarm Condition: Set parameters such as the statistical period, consecutive times, and threshold conditions as required.
- Triggering Mode: Select Immediate Triggering.
- Alarm Mode: Select Direct Alarm Reporting.
- Action Policy: Select the action policy created in Creating an Action Policy.
Click Create Now.

If the following information is displayed in the rule list, the rule is created successfully. In this example, there are multiple workloads because no workload is specified in the filter criteria. Therefore, all workloads in the cluster are displayed.

Parent topic: Monitoring, Alarms, and Logs

Last Article: Container Logs

Next Article: Namespaces

Did this article solve your problem?

Thank you for your score！Your feedback would help us improve the website.

Products

Compute

Application

Dedicated Cloud

Storage

Management & Deployment

Migration

Network

Enterprise Intelligence

Video

Database

Edge Cloud Services

DevCloud

Security

Cloud Communications

Internet of Things

Solutions

Industry-Specific Solutions

General-Purpose Solutions

Security

DevOps

Enterprise Intelligence

Essential Platform

Big Data

Visual Cognition

Speech and Semantics

Support

Help Center

Customer Services

Developers

Console

语言 - Language

中国站 - 简体中文

中国站 - English

International - 简体中文

International - English