Configuring Alarms in Alarm Center
By using AOM, Alarm Center can promptly detect cluster faults and generate alarms for service stability. Alarm Center provides built-in alarm rules, which can free you from manually configuring alarm rules on AOM. These rules are established based on the extensive cluster O&M experience of our Huawei Cloud container team and can cover container service exceptions, key metric alarms of basic cluster resources, and metric alarms of applications in a cluster to meet your routine O&M requirements.
Constraints
Only Huawei Cloud accounts, HUAWEI IDs, or IAM users with CCE administrator or FullAccess permissions can perform all operations using Alarm Center. IAM users with the CCE ReadOnlyAccess permission can only view all resources.
Enabling Alarm Center
- Click the cluster name to access the cluster console. In the navigation pane on the left, choose Alarm Center.
- On the Alarm Rules tab, click Enable Alarm Center. In the window that slides out from the right, select one or more contact groups to manage subscription terminals and receive alarm messages by group. If no contact group is available, create one by referring to Configuring Alarm Notification Recipients.
- Click OK.
Metric alarm rules can be created in Alarm Center only after the Cloud Native Cluster Monitoring add-on is installed and the AOM Prometheus instance is interconnected. For details about how to enable Monitoring Center, see Enabling Cluster Monitoring.
Event alarms in Table 1 can be reported only when Kubernetes event collection is enabled in Logging. For details, see Collecting Kubernetes Events.
Configuring Alarm Rules
After Alarm Center is enabled for clusters, you can configure and manage alarm rules.
- Log in to the CCE console.
- On the cluster list page, click the name of the target cluster to go to the details page.
- In the navigation pane on the left, choose Alarm Center. Then, click the Alarm Rules tab and configure and manage alarm rules.
By default, Alarm Center generates alarm rules for containers. The rules are intended for alarms including event alarms and metric alarms for exceptions. Alarm rules are classified into several sets. You can associate an alarm rule set with multiple contact groups and enable or disable alarm items. An alarm rule set consists of multiple alarm rules. An alarm rule corresponds to the check items for a single exception. Table 1 lists default alarm rules.
Rule Type |
Alarm Item |
Description |
Alarm Type |
Dependency Item |
PromQL/Event Name |
---|---|---|---|---|---|
Load rule set |
Abnormal pod |
Check whether the pod is running normally. |
Metric |
Cloud Native Cluster Monitoring |
sum(min_over_time(kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}[10m]) and count_over_time(kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}[10m]) > 18 )by (namespace,pod, phase, cluster_name, cluster) > 0 |
Frequent pod restarts |
Check whether the pod frequently restarts. |
Metric |
Cloud Native Cluster Monitoring |
increase(kube_pod_container_status_restarts_total[5m]) > 3 |
|
Unexpected number of Deployment replicas |
Check whether the number of Deployment replicas is the same as the expected value. |
Metric |
Cloud Native Cluster Monitoring |
(kube_deployment_spec_replicas != kube_deployment_status_replicas_available ) and ( changes(kube_deployment_status_replicas_updated[5m]) == 0) |
|
Unexpected number of StatefulSet replicas |
Check whether the number of StatefulSet replicas is the same as the expected value. |
Metric |
Cloud Native Cluster Monitoring |
(kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas) and (changes(kube_statefulset_status_replicas_updated[5m]) == 0) |
|
Container CPU usage higher than 80% |
Check whether the container CPU usage is higher than 80%. |
Metric |
Cloud Native Cluster Monitoring |
100 * (sum(rate(container_cpu_usage_seconds_total{image!="", container!="POD"}[1m])) by (cluster_name,pod,node,namespace,container, cluster) / sum(kube_pod_container_resource_limits{resource="cpu"}) by (cluster_name,pod,node,namespace,container, cluster)) > 80 |
|
Container memory usage higher than 80% |
Check whether the container memory usage is higher than 80%. |
Metric |
Cloud Native Cluster Monitoring |
(sum(container_memory_working_set_bytes{image!="", container!="POD"}) BY (cluster_name, node,container, pod , namespace, cluster) / sum(container_spec_memory_limit_bytes > 0) BY (cluster_name, node, container, pod , namespace, cluster) * 100) > 80 |
|
Abnormal container |
Check whether the container is running normally. |
Metric |
Cloud Native Cluster Monitoring |
sum by (namespace, pod, container, cluster_name, cluster) (kube_pod_container_status_waiting_reason) > 0 |
|
UpdateLoadBalancerFailed |
Check whether a load balancer is updated. |
Event |
Cloud Native Logging |
N/A |
|
Pod OOM |
Check whether an OOM occurs in the pod. |
Event |
CCE Node Problem Detector Cloud Native Logging |
PodOOMKilling |
|
Cluster status rule set |
Unavailable cluster |
Check whether the cluster is available. |
Event |
Cloud Native Logging |
N/A |
Configuring Alarm Notification Recipients
A contact group, backed on Simple Message Notification, enables message publishers and subscribers to contact each other. A contact group contains one or more terminals. You can configure contact groups to manage terminals that have subscribed to alarm messages. After creating a contact group, associate alarm rule set with the group. When an alarm is triggered, the subscription terminals in the contact group can receive the alarm messages.
- Log in to the CCE console.
- On the cluster list page, click the name of the target cluster to go to the details page.
- In the navigation pane on the left, choose Alarm Center. Then, click the Default Contact Groups tab.
- Click Create Contact Group and configure parameters.
- Contact Group Name: Enter the name of the contact group, which cannot be changed after the contact group is created. The name can contain 1 to 255 characters and must start with a letter or digit. Only letters, digits, hyphens (-), and underscores (_) are allowed.
- Alarm Message Display Name: Enter the title of the message received by the specified subscription terminal. For example, if you set Terminal Type to Email and specify a display name, the name you specified will be displayed as the alarm message sender. If no alarm message display name is specified, the sender will be username@example.com. The alarm message display name can be changed after a contact group is created.
- Add Subscription Terminal: Add one or more terminals to receive alarm messages. The terminal type can be SMS or Email. If you select SMS, enter a valid mobile number. If you select Email, enter a valid email address.
- Click OK.
You will be redirected to the contact group list. The subscription terminal is in the Unconfirmed state. Send a subscription request to the terminal to verify its validity.
- Click Request Confirmation in the Operation column to send a subscription request to the terminal. After the terminal receives and confirms the request, the subscription terminal status changes to Confirmed.
- Click to enable the contact group so that the contact group is bound to the alarm rule set.
An alarm rule set can be bound to a maximum of five contact groups.
Viewing Alarms
You can view the latest historical alarms on the Alarm list tab.
- Log in to the CCE console.
- On the cluster list page, click the name of the target cluster to go to the details page.
- In the navigation pane on the left, choose Alarm Center. Then, click the Alarms tab.
By default, all alarms to be cleared are displayed in the list. You can query alarms by alarm keyword, alarm severity, or alarm time. In addition, you can view the distribution of alarms that meet the specified criteria in different periods.
If you confirm that an alarm has been handled, click Clear in the Operation column. After the alarm is cleared, you can view it in the historical alarm list.
Figure 1 Querying alarms
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot