Updated on 2024-07-05 GMT+08:00

Configuring Alarms in Alarm Center

By using AOM, Alarm Center can promptly detect cluster faults and generate alarms for service stability. Alarm Center provides built-in alarm rules, which can free you from manually configuring alarm rules on AOM. These rules are established based on the extensive cluster O&M experience of our Huawei Cloud container team and can cover container service exceptions, key metric alarms of basic cluster resources, and metric alarms of applications in a cluster to meet your routine O&M requirements.

Constraints

Only Huawei Cloud accounts, HUAWEI IDs, or IAM users with CCE administrator or FullAccess permissions can perform all operations using Alarm Center. IAM users with the CCE ReadOnlyAccess permission can only view all resources.

Enabling Alarm Center

  1. Click the cluster name to access the cluster console. In the navigation pane on the left, choose Alarm Center.
  2. On the Alarm Rules tab, click Enable Alarm Center. In the window that slides out from the right, select one or more contact groups to manage subscription endpoints and receive alarm messages by group. If no contact group is available, create one by referring to Configuring Alarm Notification Recipients.
  3. Click OK.

    Metric alarm rules can be created in Alarm Center only after the Cloud Native Cluster Monitoring add-on is installed and the AOM Prometheus instance is interconnected. For details about how to enable Monitoring Center, see Enabling Cluster Monitoring.

    Event alarms in Table 1 can be reported only when Kubernetes event collection is enabled in Logging. For details, see Collecting Kubernetes Events.

Configuring Alarm Rules

After Alarm Center is enabled for clusters, you can configure and manage alarm rules.

  1. Log in to the CCE console.
  2. On the cluster list page, click the name of the target cluster to go to the details page.
  3. In the navigation pane on the left, choose Alarm Center. Then, click the Alarm Rules tab and configure and manage alarm rules.

    By default, Alarm Center generates alarm rules for containers. The rules are intended for alarms including event alarms and metric alarms for exceptions. Alarm rules are classified into several sets. You can associate an alarm rule set with multiple contact groups and enable or disable alarm items. An alarm rule set consists of multiple alarm rules. An alarm rule corresponds to the check items for a single exception. Table 1 lists default alarm rules.

Table 1 Default alarm rules

Rule Type

Alarm Item

Description

Alarm Type

Dependency Item

PromQL/Event Name

Load rule set

Abnormal pod

Check whether the pod is running normally.

Metric

Cloud Native Cluster Monitoring

sum(min_over_time(kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}[10m]) and count_over_time(kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}[10m]) > 18 )by (namespace,pod, phase, cluster_name, cluster) > 0

Frequent pod restarts

Check whether the pod frequently restarts.

Metric

Cloud Native Cluster Monitoring

increase(kube_pod_container_status_restarts_total[5m]) > 3

Unexpected number of Deployment replicas

Check whether the number of Deployment replicas is the same as the expected value.

Metric

Cloud Native Cluster Monitoring

(kube_deployment_spec_replicas != kube_deployment_status_replicas_available ) and ( changes(kube_deployment_status_replicas_updated[5m]) == 0)

Unexpected number of StatefulSet replicas

Check whether the number of StatefulSet replicas is the same as the expected value.

Metric

Cloud Native Cluster Monitoring

(kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas) and (changes(kube_statefulset_status_replicas_updated[5m]) == 0)

Container CPU usage higher than 80%

Check whether the container CPU usage is higher than 80%.

Metric

Cloud Native Cluster Monitoring

100 * (sum(rate(container_cpu_usage_seconds_total{image!="", container!="POD"}[1m])) by (cluster_name,pod,node,namespace,container, cluster) / sum(kube_pod_container_resource_limits{resource="cpu"}) by (cluster_name,pod,node,namespace,container, cluster)) > 80

Container memory usage higher than 80%

Check whether the container memory usage is higher than 80%.

Metric

Cloud Native Cluster Monitoring

(sum(container_memory_working_set_bytes{image!="", container!="POD"}) BY (cluster_name, node,container, pod , namespace, cluster) / sum(container_spec_memory_limit_bytes > 0) BY (cluster_name, node, container, pod , namespace, cluster) * 100) > 80

Abnormal container

Check whether the container is running normally.

Metric

Cloud Native Cluster Monitoring

sum by (namespace, pod, container, cluster_name, cluster) (kube_pod_container_status_waiting_reason) > 0

UpdateLoadBalancerFailed

Check whether a load balancer is updated.

Event

Cloud Native Logging

N/A

Pod OOM

Check whether an OOM occurs in the pod.

Event

CCE Node Problem Detector

Cloud Native Logging

PodOOMKilling

Cluster status rule set

Unavailable cluster

Check whether the cluster is available.

Event

Cloud Native Logging

N/A

Configuring Alarm Notification Recipients

A contact group, backed on Simple Message Notification, enables message publishers and subscribers to contact each other. A contact group contains one or more endpoints. You can configure contact groups to manage endpoints that have subscribed to alarm messages. After creating a contact group, associate alarm rule set with the group. When an alarm is triggered, the subscription endpoints in the contact group can receive the alarm messages.

  1. Log in to the CCE console.
  2. On the cluster list page, click the name of the target cluster to go to the details page.
  3. In the navigation pane on the left, choose Alarm Center. Then, click the Contact Groups tab.
  4. Click Create Contact Group and configure parameters.

    • Contact Group Name: Enter the name of the contact group, which cannot be changed after the contact group is created. The name can contain 1 to 255 characters and must start with a letter or digit. Only letters, digits, hyphens (-), and underscores (_) are allowed.
    • Alarm message display name: Enter the title of the message received by the specified subscription endpoint. For example, if you set Terminal Type to Email and specify a display name, the name you specified will be displayed as the alarm message sender. If you do not specify Alarm message display name, the sender will be username@example.com. The display name of an alarm message can be changed after the contact group is created.
    • Add Subscription Terminal: Add one or more endpoints to receive alarm messages. The endpoint type can be SMS or Email. If you select SMS, enter a valid mobile number. If you select Email, enter a valid email address.

  5. Click OK.

    You will be redirected to the contact group list. The subscription endpoint is in the Unconfirmed state. Send a subscription request to the endpoint to verify the validity of the endpoint.

  6. Click Request Confirmation in the Operation column to send a subscription request to the endpoint. If the endpoint receives the request, confirm the request as prompted. After the confirmation is complete, the subscription endpoint changes to Confirmed.
  7. Click to enable the contact group so that the contact group is bound to the alarm rule set.

    An alarm rule set can be bound to a maximum of five contact groups.

Viewing Alarms

You can view the latest historical alarms on the Alarm list tab.

  1. Log in to the CCE console.
  2. On the cluster list page, click the name of the target cluster to go to the details page.
  3. In the navigation pane on the left, choose Alarm Center. Then, click the Alarms tab.

    By default, all alarms to be cleared are displayed in the list. You can query alarms by alarm keyword, alarm severity, or alarm time. In addition, you can view the distribution of alarms that meet the specified criteria in different periods.

    If you confirm that an alarm has been handled, click Clear in the Operation column. After the alarm is cleared, you can view it in the historical alarm list.

    Figure 1 Querying alarms