Configuring Alarms in Alarm Center

By using AOM, Alarm Center can promptly detect cluster faults and generate alarms for service stability. Alarm Center provides built-in alarm rules, which can free you from manually configuring alarm rules on AOM. These rules are established based on the extensive cluster O&M experience of our Huawei Cloud container team and can cover container service exceptions, key metric alarms of basic cluster resources, and metric alarms of applications in a cluster to meet your routine O&M requirements.

Constraints

The cluster version must be v1.17 or later.
Only Huawei Cloud accounts, HUAWEI IDs, or IAM users with CCE administrator or FullAccess permissions can perform all operations using Alarm Center. IAM users with the CCE ReadOnlyAccess permission can only view all resources.

Enabling Alarm Center

Alarm Center can be enabled for CCE standard clusters and CCE Turbo clusters.

Click the cluster name to access the cluster console. In the navigation pane, choose Alarm Center.
On the Alarm Rules tab, click Enable Alarm Center. In the window that slides out from the right, select one or more contact groups to manage subscription endpoints and receive alarm messages by group. If no contact group is available, create one by referring to Configuring Alarm Notification Recipients.
Click OK.

Metric alarm rules can be created in Alarm Center only after the Cloud Native Cluster Monitoring add-on is installed and the AOM Prometheus instance is interconnected. For details about how to enable Monitoring Center, see Enabling Cluster Monitoring.

The alarm rules that use the problem_gauge metric in Table 1 depend on the CCE Node Problem Detector add-on (CCE Node Problem Detector). To use related alarm rules, ensure that the CCE Node Problem Detector add-on has been installed and is running normally.

Event alarms in Table 1 can be reported only when Kubernetes event collection is enabled in Logging. For details, see Collecting Kubernetes Events.

Configuring Alarm Rules

After Alarm Center is enabled for CCE standard clusters and CCE Turbo clusters, you can configure and manage alarm rules.

Log in to the CCE console.
On the cluster list page, click the cluster name to access the cluster console.
In the navigation pane, choose Alarm Center. Then, click the Alarm Rules tab and configure and manage alarm rules.

By default, Alarm Center generates alarm rules for containers. The rules are intended for alarms including event alarms and metric alarms for exceptions. Alarm rules are classified into several sets. You can associate an alarm rule set with multiple contact groups and enable or disable alarm items. An alarm rule set consists of multiple alarm rules. An alarm rule corresponds to the check items for a single exception. Table 1 lists default alarm rules.

**Table 1** Default alarm rules
Rule Type	Alarm Item	Description	Alarm Type	Dependency Item	PromQL/Event Name
Load rule set	Abnormal pod	Check whether the pod is running normally.	Metric	Cloud Native Cluster Monitoring	sum(min_over_time(kube_pod_status_phase{phase=~"Pending\|Unknown\|Failed"}[10m]) and count_over_time(kube_pod_status_phase{phase=~"Pending\|Unknown\|Failed"}[10m]) > 18 )by (namespace,pod, phase, cluster_name, cluster) > 0
	Frequent pod restarts	Check whether the pod frequently restarts.	Metric	Cloud Native Cluster Monitoring	increase(kube_pod_container_status_restarts_total[5m]) > 3
	Unexpected number of Deployment replicas	Check whether the number of Deployment replicas is the same as the expected value.	Metric	Cloud Native Cluster Monitoring	(kube_deployment_spec_replicas != kube_deployment_status_replicas_available ) and ( changes(kube_deployment_status_replicas_updated[5m]) == 0)
	Unexpected number of StatefulSet replicas	Check whether the number of StatefulSet replicas is the same as the expected value.	Metric	Cloud Native Cluster Monitoring	(kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas) and (changes(kube_statefulset_status_replicas_updated[5m]) == 0)
	Container CPU usage higher than 80%	Check whether the container CPU usage is higher than 80%.	Metric	Cloud Native Cluster Monitoring	100 * (sum(rate(container_cpu_usage_seconds_total{image!="", container!="POD"}[1m])) by (cluster_name,pod,node,namespace,container, cluster) / sum(kube_pod_container_resource_limits{resource="cpu"}) by (cluster_name,pod,node,namespace,container, cluster)) > 80
	Container memory usage higher than 80%	Check whether the container memory usage is higher than 80%.	Metric	Cloud Native Cluster Monitoring	(sum(container_memory_working_set_bytes{image!="", container!="POD"}) BY (cluster_name, node,container, pod , namespace, cluster) / sum(container_spec_memory_limit_bytes > 0) BY (cluster_name, node, container, pod , namespace, cluster) * 100) > 80
	Abnormal container	Check whether the container is running normally.	Metric	Cloud Native Cluster Monitoring	sum by (namespace, pod, container, cluster_name, cluster) (kube_pod_container_status_waiting_reason) > 0
	Load balancer update failed	Check whether a load balancer is updated.	Event	Cloud Native Logging	N/A
	Pod OOM	Check whether OOM occurs on the pod.	Event	CCE Node Problem Detector (1.18.41 or later) Cloud Native Logging (1.3.2 or later)	PodOOMKilling
Node resource rule set	High usage of Kubernetes PV	Check whether the PV usage on a node is too high.	Metric	Cloud Native Cluster Monitoring	(kubelet_volume_stats_available_bytes{job="kubelet"} / kubelet_volume_stats_capacity_bytes{job="kubelet"}) < 0.03 and kubelet_volume_stats_used_bytes{job="kubelet"} > 0
	Abnormal Kubernetes PVC	Check whether the PVC is normal.	Metric	Cloud Native Cluster Monitoring	kube_persistentvolumeclaim_status_phase{phase=~"Failed\|Pending\|Lost"} > 0
	Abnormal Kubernetes PV	Check whether the PV is normal.	Metric	Cloud Native Cluster Monitoring	kube_persistentvolume_status_phase{phase=~"Failed\|Pending"} > 0
	Node CPU usage higher than 80%	Check whether the node CPU usage is higher than 80%.	Metric	Cloud Native Cluster Monitoring	100 - (avg by(node, cluster_name, cluster) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
	Available node memory less than 10%	Check whether the available node memory is less than 10%.	Metric	Cloud Native Cluster Monitoring	node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
	Available node disk space less than 10%	Check whether the available node disk space is less than 10%.	Metric	Cloud Native Cluster Monitoring	avg((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes) by (device, node, cluster_name, cluster) < 10
	Insufficient node disk space	Check whether the node disk space is sufficient.	Event	Cloud Native Logging	N/A
	emptyDir storage pool error	Check whether the node's EV storage pool is functional.	Metric	Cloud Native Cluster Monitoring CCE Node Problem Detector	problem_gauge{type="EmptyDirVolumeGroupStatusError"} >= 1
	Insufficient node memory	Check whether the overall node memory is sufficient.	Metric	Cloud Native Cluster Monitoring CCE Node Problem Detector	problem_gauge{type="MemoryProblem"} >= 1
	PV storage pool error	Check whether the node's PV storage pool is functional.	Metric	Cloud Native Cluster Monitoring CCE Node Problem Detector	problem_gauge{type="LocalPvVolumeGroupStatusError"} >= 1
	Abnormal node mount point	Check whether the node's mount point is available.	Metric	Cloud Native Cluster Monitoring CCE Node Problem Detector	problem_gauge{type="MountPointProblem"} >= 1
	Insufficient node file handles	Check whether the FD file handles are sufficient.	Metric	Cloud Native Cluster Monitoring CCE Node Problem Detector	problem_gauge{type="FDProblem"} >= 1
	Node disk I/O suspension	Check whether I/O suspension occurs on the node disk.	Metric	Cloud Native Cluster Monitoring CCE Node Problem Detector	problem_gauge{type="DiskHung"} >= 1
	Node disk read-only	Check whether the node disk is read-only.	Metric	Cloud Native Cluster Monitoring CCE Node Problem Detector	problem_gauge{type="DiskReadonly"} >= 1
	Abnormal node disk	Check the usage of the node's system disk and CCE data disks (including Docker and kubelet logical disks).	Metric	Cloud Native Cluster Monitoring CCE Node Problem Detector	problem_gauge{type="DiskProblem"} >= 1
	Slow node disk I/O	Check whether slow I/O occurs on the node disk.	Metric	Cloud Native Cluster Monitoring CCE Node Problem Detector	problem_gauge{type="DiskSlow"} >= 1
	Insufficient node PIDs	Check whether the PIDs are sufficient.	Metric	Cloud Native Cluster Monitoring CCE Node Problem Detector	problem_gauge{type="PIDProblem"} >= 1
	Node conntrack table full	Check whether the node's conntrack table space is sufficient.	Metric	Cloud Native Cluster Monitoring CCE Node Problem Detector	problem_gauge{type="ConntrackFullProblem"} >= 1
Node status rule set	ResolvConf error	Check whether the ResolvConf configuration file is available.	Metric	Cloud Native Cluster Monitoring CCE Node Problem Detector	problem_gauge{type="ResolvConfFileProblem"} >= 1
	Abnormal node CNI component	Check whether the CNI component of the node is running properly.	Metric	Cloud Native Cluster Monitoring CCE Node Problem Detector	problem_gauge{type="CNIProblem"} >= 1
	Abnormal node CRI component	Check the running of the key component CRI (Docker or containerd).	Metric	Cloud Native Cluster Monitoring CCE Node Problem Detector	problem_gauge{type="CRIProblem"} >= 1
	Node kube-proxy error	Check whether kube-proxy is running properly.	Metric	Cloud Native Cluster Monitoring CCE Node Problem Detector	problem_gauge{type="KUBEPROXYProblem"} >= 1
	Abnormal node kubelet	Check whether kubelet is running normally.	Metric	Cloud Native Cluster Monitoring CCE Node Problem Detector	problem_gauge{type="KUBELETProblem"} >= 1
	Scheduled event on the node	Check whether there is a scheduled event on the node.	Metric	Cloud Native Cluster Monitoring CCE Node Problem Detector	problem_gauge{type="ScheduledEvent"} >= 1
	Unstable node status	Check whether the node status alternates between normal and abnormal.	Metric	Cloud Native Cluster Monitoring CCE Node Problem Detector	sum(changes(kube_node_status_condition{status="true",condition="Ready"}[15m])) by (cluster_name, node, cluster) > 2
	Frequent node containerd restarts	Check whether containerd frequently restarts.	Metric	Cloud Native Cluster Monitoring CCE Node Problem Detector	problem_gauge{type="FrequentContainerdRestart"} >= 1
	Node task suspended	Check whether a task is suspended on the node.	Event	Cloud Native Logging	TaskHung
	Incorrect node storage pool configuration	Check whether the node's EV and PV storage pools are correctly configured.	Event	Cloud Native Logging	InvalidStoragePool
	Abnormal node	Check whether the node is running normally.	Event	Cloud Native Logging	NodeNotReady
	Abnormal node process D	Check whether there is a D state process on the node.	Metric	Cloud Native Cluster Monitoring CCE Node Problem Detector	problem_gauge{type="ProcessD"} >= 1
	Abnormal node process Z	Check whether there is a Z state process on the node.	Metric	Cloud Native Cluster Monitoring CCE Node Problem Detector	problem_gauge{type="ProcessZ"} >= 1
	Frequent node CRI restarts	Check whether CRI frequently restarts.	Metric	Cloud Native Cluster Monitoring CCE Node Problem Detector	problem_gauge{type="FrequentCRIRestart"} >= 1
	Frequent node Docker restarts	Check whether Docker frequently restarts.	Metric	Cloud Native Cluster Monitoring CCE Node Problem Detector	problem_gauge{type="FrequentDockerRestart"} >= 1
	Frequent node kubelet restarts	Check whether kubelet frequently restarts.	Metric	Cloud Native Cluster Monitoring CCE Node Problem Detector	problem_gauge{type="FrequentKubeletRestart"} >= 1
	Node NTP service error	Check whether the node clock synchronization service ntpd or chronyd is running properly.	Metric	Cloud Native Cluster Monitoring CCE Node Problem Detector	problem_gauge{type="NTPProblem"} >= 1
	Processes forcibly stopped due to node OOM	Check whether an OOM event occurred on the node.	Event	CCE Node Problem Detector	OOMKilling
Node scaling rule set	Node pool resources sold out	Check whether the node pool resources are sufficient.	Event	Cloud Native Logging	NodePoolSoldOut
	Scale-out timed out	Check whether adding nodes to the node pool timed out.	Event	Cloud Native Logging	ScaleUpTimedOut
	Node pool scale-out failed	Check whether an error occurred during a node pool scale-out.	Event	Cloud Native Logging	FailedToScaleUpGroup
	Node pool scale-in failed	Check whether an error occurred during a node pool scale-in.	Event	Cloud Native Logging	ScaleDownFailed
Cluster status rule set	Unavailable cluster	Check whether the cluster is available.	Event	Cloud Native Logging	N/A

Configuring Alarm Notification Recipients

A contact group, backed on Simple Message Notification, enables message publishers and subscribers to contact each other. A contact group contains one or more endpoints. You can configure contact groups to manage endpoints that have subscribed to alarm messages. After creating a contact group, associate alarm rule set with the group. When an alarm is triggered, the subscription endpoints in the contact group can receive the alarm messages.

Log in to the CCE console.
On the cluster list page, click the cluster name to access the cluster console.
In the navigation pane, choose Alarm Center. Then, click the Default Contact Groups tab.
Click Create Contact Group and configure parameters.
- Contact Group Name: Enter the name of the contact group, which cannot be changed after the contact group is created. The name can contain 1 to 255 characters and must start with a letter or digit. Only letters, digits, hyphens (-), and underscores (_) are allowed.
- Alarm Message Display Name: Enter the title of the message received by the specified subscription endpoint. For example, if you set Terminal Type to Email and specify a display name, the name you specified will be displayed as the alarm message sender. If no alarm message display name is specified, the sender will be username@example.com. The alarm message display name can be changed after a contact group is created.
- Add Subscription Terminal: Add one or more endpoints to receive alarm messages. The endpoint type can be SMS or Email. If you select SMS, enter a valid mobile number. If you select Email, enter a valid email address.
Click OK.

You will be redirected to the contact group list. The subscription endpoint is in the Unconfirmed state. Send a subscription request to the endpoint to verify the validity of the endpoint.
Click Request Confirmation in the Operation column to send a subscription request to the endpoint. After the endpoint receives and confirms the request, the subscription endpoint status changes to Confirmed.
Click to enable the contact group so that the contact group is bound to the alarm rule set.

An alarm rule set can be bound to a maximum of five contact groups.

Viewing Alarms

You can view the latest historical alarms on the Alarms tab.

Log in to the CCE console.
On the cluster list page, click the cluster name to access the cluster console.
In the navigation pane, choose Alarm Center. Then, click the Alarms tab.

By default, all alarms to be cleared are displayed in the list. You can query alarms by alarm keyword, alarm severity, or alarm time. In addition, you can view the distribution of alarms that meet the specified criteria in different periods.

If an alarm to be cleared is not triggered within 10 minutes, the alarm is considered cleared by default and converted to a historical alarm. If you confirm that an alarm has been handled in advance, you can also click Clear in the Operation column. You can view this cleared alarm in the historical alarm list.

Figure 1 Querying alarms