Configuring Alarms in Alarm Center
By using AOM, Alarm Center can promptly detect cluster faults and generate alarms for service stability. Alarm Center provides built-in alarm rules, which can free you from manually configuring alarm rules on AOM. These rules are established based on the extensive cluster O&M experience of our Huawei Cloud container team and can cover container service exceptions, key metric alarms of basic cluster resources, and metric alarms of applications in a cluster to meet your routine O&M requirements.
Constraints
- The cluster version must be v1.17 or later.
- Only Huawei Cloud accounts, HUAWEI IDs, or IAM users with CCE administrator or FullAccess permissions can perform all operations using Alarm Center. IAM users with the CCE ReadOnlyAccess permission can only view all resources.
Enabling Alarm Center
Alarm Center can be enabled for CCE standard clusters and CCE Turbo clusters.
- Click the cluster name to access the cluster console. In the navigation pane, choose Alarm Center.
- On the Alarm Rules tab, click Enable Alarm Center. In the window that slides out from the right, select one or more contact groups to manage subscription endpoints and receive alarm messages by group. If no contact group is available, create one by referring to Configuring Alarm Notification Recipients.
- Click OK.
Metric alarm rules can be created in Alarm Center only after the Cloud Native Cluster Monitoring add-on is installed and the AOM Prometheus instance is interconnected. For details about how to enable Monitoring Center, see Enabling Cluster Monitoring.
The alarm rules that use the problem_gauge metric in Table 1 depend on the CCE Node Problem Detector add-on (CCE Node Problem Detector). To use related alarm rules, ensure that the CCE Node Problem Detector add-on has been installed and is running normally.
Event alarms in Table 1 can be reported only when Kubernetes event collection is enabled in Logging. For details, see Collecting Kubernetes Events.
Configuring Alarm Rules
After Alarm Center is enabled for CCE standard clusters and CCE Turbo clusters, you can configure and manage alarm rules.
- Log in to the CCE console.
- On the cluster list page, click the cluster name to access the cluster console.
- In the navigation pane, choose Alarm Center. Then, click the Alarm Rules tab and configure and manage alarm rules.
By default, Alarm Center generates alarm rules for containers. The rules are intended for alarms including event alarms and metric alarms for exceptions. Alarm rules are classified into several sets. You can associate an alarm rule set with multiple contact groups and enable or disable alarm items. An alarm rule set consists of multiple alarm rules. An alarm rule corresponds to the check items for a single exception. Table 1 lists default alarm rules.
Rule Type |
Alarm Item |
Description |
Alarm Type |
Dependency Item |
PromQL/Event Name |
---|---|---|---|---|---|
Load rule set |
Abnormal pod |
Check whether the pod is running normally. |
Metric |
Cloud Native Cluster Monitoring |
sum(min_over_time(kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}[10m]) and count_over_time(kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}[10m]) > 18 )by (namespace,pod, phase, cluster_name, cluster) > 0 |
Frequent pod restarts |
Check whether the pod frequently restarts. |
Metric |
Cloud Native Cluster Monitoring |
increase(kube_pod_container_status_restarts_total[5m]) > 3 |
|
Unexpected number of Deployment replicas |
Check whether the number of Deployment replicas is the same as the expected value. |
Metric |
Cloud Native Cluster Monitoring |
(kube_deployment_spec_replicas != kube_deployment_status_replicas_available ) and ( changes(kube_deployment_status_replicas_updated[5m]) == 0) |
|
Unexpected number of StatefulSet replicas |
Check whether the number of StatefulSet replicas is the same as the expected value. |
Metric |
Cloud Native Cluster Monitoring |
(kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas) and (changes(kube_statefulset_status_replicas_updated[5m]) == 0) |
|
Container CPU usage higher than 80% |
Check whether the container CPU usage is higher than 80%. |
Metric |
Cloud Native Cluster Monitoring |
100 * (sum(rate(container_cpu_usage_seconds_total{image!="", container!="POD"}[1m])) by (cluster_name,pod,node,namespace,container, cluster) / sum(kube_pod_container_resource_limits{resource="cpu"}) by (cluster_name,pod,node,namespace,container, cluster)) > 80 |
|
Container memory usage higher than 80% |
Check whether the container memory usage is higher than 80%. |
Metric |
Cloud Native Cluster Monitoring |
(sum(container_memory_working_set_bytes{image!="", container!="POD"}) BY (cluster_name, node,container, pod , namespace, cluster) / sum(container_spec_memory_limit_bytes > 0) BY (cluster_name, node, container, pod , namespace, cluster) * 100) > 80 |
|
Abnormal container |
Check whether the container is running normally. |
Metric |
Cloud Native Cluster Monitoring |
sum by (namespace, pod, container, cluster_name, cluster) (kube_pod_container_status_waiting_reason) > 0 |
|
Load balancer update failed |
Check whether a load balancer is updated. |
Event |
Cloud Native Logging |
N/A |
|
Pod OOM |
Check whether OOM occurs on the pod. |
Event |
CCE Node Problem Detector (1.18.41 or later) Cloud Native Logging (1.3.2 or later) |
PodOOMKilling |
|
Node resource rule set |
High usage of Kubernetes PV |
Check whether the PV usage on a node is too high. |
Metric |
Cloud Native Cluster Monitoring |
(kubelet_volume_stats_available_bytes{job="kubelet"} / kubelet_volume_stats_capacity_bytes{job="kubelet"}) < 0.03 and kubelet_volume_stats_used_bytes{job="kubelet"} > 0 |
Abnormal Kubernetes PVC |
Check whether the PVC is normal. |
Metric |
Cloud Native Cluster Monitoring |
kube_persistentvolumeclaim_status_phase{phase=~"Failed|Pending|Lost"} > 0 |
|
Abnormal Kubernetes PV |
Check whether the PV is normal. |
Metric |
Cloud Native Cluster Monitoring |
kube_persistentvolume_status_phase{phase=~"Failed|Pending"} > 0 |
|
Node CPU usage higher than 80% |
Check whether the node CPU usage is higher than 80%. |
Metric |
Cloud Native Cluster Monitoring |
100 - (avg by(node, cluster_name, cluster) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80 |
|
Available node memory less than 10% |
Check whether the available node memory is less than 10%. |
Metric |
Cloud Native Cluster Monitoring |
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10 |
|
Available node disk space less than 10% |
Check whether the available node disk space is less than 10%. |
Metric |
Cloud Native Cluster Monitoring |
avg((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes) by (device, node, cluster_name, cluster) < 10 |
|
Insufficient node disk space |
Check whether the node disk space is sufficient. |
Event |
Cloud Native Logging |
N/A |
|
emptyDir storage pool error |
Check whether the node's EV storage pool is functional. |
Metric |
Cloud Native Cluster Monitoring CCE Node Problem Detector |
problem_gauge{type="EmptyDirVolumeGroupStatusError"} >= 1 |
|
Insufficient node memory |
Check whether the overall node memory is sufficient. |
Metric |
Cloud Native Cluster Monitoring CCE Node Problem Detector |
problem_gauge{type="MemoryProblem"} >= 1 |
|
PV storage pool error |
Check whether the node's PV storage pool is functional. |
Metric |
Cloud Native Cluster Monitoring CCE Node Problem Detector |
problem_gauge{type="LocalPvVolumeGroupStatusError"} >= 1 |
|
Abnormal node mount point |
Check whether the node's mount point is available. |
Metric |
Cloud Native Cluster Monitoring CCE Node Problem Detector |
problem_gauge{type="MountPointProblem"} >= 1 |
|
Insufficient node file handles |
Check whether the FD file handles are sufficient. |
Metric |
Cloud Native Cluster Monitoring CCE Node Problem Detector |
problem_gauge{type="FDProblem"} >= 1 |
|
Node disk I/O suspension |
Check whether I/O suspension occurs on the node disk. |
Metric |
Cloud Native Cluster Monitoring CCE Node Problem Detector |
problem_gauge{type="DiskHung"} >= 1 |
|
Node disk read-only |
Check whether the node disk is read-only. |
Metric |
Cloud Native Cluster Monitoring CCE Node Problem Detector |
problem_gauge{type="DiskReadonly"} >= 1 |
|
Abnormal node disk |
Check the usage of the node's system disk and CCE data disks (including Docker and kubelet logical disks). |
Metric |
Cloud Native Cluster Monitoring CCE Node Problem Detector |
problem_gauge{type="DiskProblem"} >= 1 |
|
Slow node disk I/O |
Check whether slow I/O occurs on the node disk. |
Metric |
Cloud Native Cluster Monitoring CCE Node Problem Detector |
problem_gauge{type="DiskSlow"} >= 1 |
|
Insufficient node PIDs |
Check whether the PIDs are sufficient. |
Metric |
Cloud Native Cluster Monitoring CCE Node Problem Detector |
problem_gauge{type="PIDProblem"} >= 1 |
|
Node conntrack table full |
Check whether the node's conntrack table space is sufficient. |
Metric |
Cloud Native Cluster Monitoring CCE Node Problem Detector |
problem_gauge{type="ConntrackFullProblem"} >= 1 |
|
Node status rule set |
ResolvConf error |
Check whether the ResolvConf configuration file is available. |
Metric |
Cloud Native Cluster Monitoring CCE Node Problem Detector |
problem_gauge{type="ResolvConfFileProblem"} >= 1 |
Abnormal node CNI component |
Check whether the CNI component of the node is running properly. |
Metric |
Cloud Native Cluster Monitoring CCE Node Problem Detector |
problem_gauge{type="CNIProblem"} >= 1 |
|
Abnormal node CRI component |
Check the running of the key component CRI (Docker or containerd). |
Metric |
Cloud Native Cluster Monitoring CCE Node Problem Detector |
problem_gauge{type="CRIProblem"} >= 1 |
|
Node kube-proxy error |
Check whether kube-proxy is running properly. |
Metric |
Cloud Native Cluster Monitoring CCE Node Problem Detector |
problem_gauge{type="KUBEPROXYProblem"} >= 1 |
|
Abnormal node kubelet |
Check whether kubelet is running normally. |
Metric |
Cloud Native Cluster Monitoring CCE Node Problem Detector |
problem_gauge{type="KUBELETProblem"} >= 1 |
|
Scheduled event on the node |
Check whether there is a scheduled event on the node. |
Metric |
Cloud Native Cluster Monitoring CCE Node Problem Detector |
problem_gauge{type="ScheduledEvent"} >= 1 |
|
Unstable node status |
Check whether the node status alternates between normal and abnormal. |
Metric |
Cloud Native Cluster Monitoring CCE Node Problem Detector |
sum(changes(kube_node_status_condition{status="true",condition="Ready"}[15m])) by (cluster_name, node, cluster) > 2 |
|
Frequent node containerd restarts |
Check whether containerd frequently restarts. |
Metric |
Cloud Native Cluster Monitoring CCE Node Problem Detector |
problem_gauge{type="FrequentContainerdRestart"} >= 1 |
|
Node task suspended |
Check whether a task is suspended on the node. |
Event |
Cloud Native Logging |
TaskHung |
|
Incorrect node storage pool configuration |
Check whether the node's EV and PV storage pools are correctly configured. |
Event |
Cloud Native Logging |
InvalidStoragePool |
|
Abnormal node |
Check whether the node is running normally. |
Event |
Cloud Native Logging |
NodeNotReady |
|
Abnormal node process D |
Check whether there is a D state process on the node. |
Metric |
Cloud Native Cluster Monitoring CCE Node Problem Detector |
problem_gauge{type="ProcessD"} >= 1 |
|
Abnormal node process Z |
Check whether there is a Z state process on the node. |
Metric |
Cloud Native Cluster Monitoring CCE Node Problem Detector |
problem_gauge{type="ProcessZ"} >= 1 |
|
Frequent node CRI restarts |
Check whether CRI frequently restarts. |
Metric |
Cloud Native Cluster Monitoring CCE Node Problem Detector |
problem_gauge{type="FrequentCRIRestart"} >= 1 |
|
Frequent node Docker restarts |
Check whether Docker frequently restarts. |
Metric |
Cloud Native Cluster Monitoring CCE Node Problem Detector |
problem_gauge{type="FrequentDockerRestart"} >= 1 |
|
Frequent node kubelet restarts |
Check whether kubelet frequently restarts. |
Metric |
Cloud Native Cluster Monitoring CCE Node Problem Detector |
problem_gauge{type="FrequentKubeletRestart"} >= 1 |
|
Node NTP service error |
Check whether the node clock synchronization service ntpd or chronyd is running properly. |
Metric |
Cloud Native Cluster Monitoring CCE Node Problem Detector |
problem_gauge{type="NTPProblem"} >= 1 |
|
Processes forcibly stopped due to node OOM |
Check whether an OOM event occurred on the node. |
Event |
CCE Node Problem Detector |
OOMKilling |
|
Node scaling rule set |
Node pool resources sold out |
Check whether the node pool resources are sufficient. |
Event |
Cloud Native Logging |
NodePoolSoldOut |
Scale-out timed out |
Check whether adding nodes to the node pool timed out. |
Event |
Cloud Native Logging |
ScaleUpTimedOut |
|
Node pool scale-out failed |
Check whether an error occurred during a node pool scale-out. |
Event |
Cloud Native Logging |
FailedToScaleUpGroup |
|
Node pool scale-in failed |
Check whether an error occurred during a node pool scale-in. |
Event |
Cloud Native Logging |
ScaleDownFailed |
|
Cluster status rule set |
Unavailable cluster |
Check whether the cluster is available. |
Event |
Cloud Native Logging |
N/A |
Configuring Alarm Notification Recipients
A contact group, backed on Simple Message Notification, enables message publishers and subscribers to contact each other. A contact group contains one or more endpoints. You can configure contact groups to manage endpoints that have subscribed to alarm messages. After creating a contact group, associate alarm rule set with the group. When an alarm is triggered, the subscription endpoints in the contact group can receive the alarm messages.
- Log in to the CCE console.
- On the cluster list page, click the cluster name to access the cluster console.
- In the navigation pane, choose Alarm Center. Then, click the Default Contact Groups tab.
- Click Create Contact Group and configure parameters.
- Contact Group Name: Enter the name of the contact group, which cannot be changed after the contact group is created. The name can contain 1 to 255 characters and must start with a letter or digit. Only letters, digits, hyphens (-), and underscores (_) are allowed.
- Alarm Message Display Name: Enter the title of the message received by the specified subscription endpoint. For example, if you set Terminal Type to Email and specify a display name, the name you specified will be displayed as the alarm message sender. If no alarm message display name is specified, the sender will be username@example.com. The alarm message display name can be changed after a contact group is created.
- Add Subscription Terminal: Add one or more endpoints to receive alarm messages. The endpoint type can be SMS or Email. If you select SMS, enter a valid mobile number. If you select Email, enter a valid email address.
- Click OK.
You will be redirected to the contact group list. The subscription endpoint is in the Unconfirmed state. Send a subscription request to the endpoint to verify the validity of the endpoint.
- Click Request Confirmation in the Operation column to send a subscription request to the endpoint. After the endpoint receives and confirms the request, the subscription endpoint status changes to Confirmed.
- Click to enable the contact group so that the contact group is bound to the alarm rule set.
An alarm rule set can be bound to a maximum of five contact groups.
Viewing Alarms
You can view the latest historical alarms on the Alarms tab.
- Log in to the CCE console.
- On the cluster list page, click the cluster name to access the cluster console.
- In the navigation pane, choose Alarm Center. Then, click the Alarms tab.
By default, all alarms to be cleared are displayed in the list. You can query alarms by alarm keyword, alarm severity, or alarm time. In addition, you can view the distribution of alarms that meet the specified criteria in different periods.
If an alarm to be cleared is not triggered within 10 minutes, the alarm is considered cleared by default and converted to a historical alarm. If you confirm that an alarm has been handled in advance, you can also click Clear in the Operation column. You can view this cleared alarm in the historical alarm list.
Figure 1 Querying alarms
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot