Updated on 2024-06-26 GMT+08:00

Configuring Alarms on Alarm Center

By using AOM, Alarm Center can promptly detect cluster faults and generate alarms for service stability. Alarm Center provides built-in alarm rules, which can free you from manually configuring alarm rules on AOM. These rules are established based on the extensive cluster O&M experience of our Huawei Cloud container team and can cover container service exceptions, key metric alarms of basic cluster resources, and metric alarms of applications in a cluster to meet your routine O&M requirements.

Constraints

  • The cluster must be v1.17 or later.
  • Only Huawei Cloud accounts, HUAWEI IDs, or IAM users with CCE administrator or FullAccess permissions can perform all operations using Alarm Center. IAM users with the CCE ReadOnlyAccess permission can only view all resources.

Enabling Alarm Center

Alarm Center can be enabled for CCE standard clusters and CCE Turbo clusters.

  1. Click the cluster name to access the cluster console. In the navigation pane on the left, choose Alarm Center.
  2. On the Alarm Rules tab, click Enable Alarm Center. In the window that slides out from the right, select one or more contact groups to manage subscription endpoints and receive alarm messages by group.
  3. Click OK.

    Metric alarm rules can be created on Alarm Center only after Cloud Native Cluster Monitoring is installed and the AOM Prometheus instance is interconnected. For details about how to enable Monitoring Center, see Enabling Cluster Monitoring.

    The alarm rules that use the problem_gauge metric in Table 1 depend on CCE Node Problem Detector (CCE Node Problem Detector). To use related alarm rules, ensure that CCE Node Problem Detector has been installed and is running normally.

    Event alarms in Table 1 can be reported only when Kubernetes event collection is enabled in Logging. For details, see Collecting Kubernetes Events.

Configuring Alarm Rules

After Alarm Center is enabled for CCE standard clusters and CCE Turbo clusters, you can configure and manage alarm rules.

  1. Log in to the CCE console.
  2. On the cluster list page, click the cluster name to access the cluster console.
  3. In the navigation pane on the left, choose Alarm Center. Then, click the Alarm Rules tab and configure and manage alarm rules.

    By default, Alarm Center generates alarm rules for containers. The rules are intended for alarms including event alarms and metric alarms for exceptions. Alarm rules are classified into several sets. You can associate an alarm rule set with multiple contact groups and enable or disable alarm items. An alarm rule set consists of multiple alarm rules. An alarm rule corresponds to the check items for a single exception. Table 1 lists default alarm rules.

Table 1 Default alarm rules

Rule Type

Alarm Item

Description

Alarm Type

Dependency Item

PromQL/Event Name

Load rule set

Abnormal pod

Check whether the pod is running normally.

Metric

Cloud Native Cluster Monitoring

sum(min_over_time(kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}[10m]) and count_over_time(kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}[10m]) > 18 )by (namespace,pod, phase, cluster_name, cluster) > 0

Frequent pod restarts

Check whether the pod frequently restarts.

Metric

Cloud Native Cluster Monitoring

increase(kube_pod_container_status_restarts_total[5m]) > 3

Unexpected number of Deployment replicas

Check whether the number of Deployment replicas is the same as the expected value.

Metric

Cloud Native Cluster Monitoring

(kube_deployment_spec_replicas != kube_deployment_status_replicas_available ) and ( changes(kube_deployment_status_replicas_updated[5m]) == 0)

Unexpected number of StatefulSet replicas

Check whether the number of StatefulSet replicas is the same as the expected value.

Metric

Cloud Native Cluster Monitoring

(kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas) and (changes(kube_statefulset_status_replicas_updated[5m]) == 0)

Container CPU usage higher than 80%

Check whether the container CPU usage is higher than 80%.

Metric

Cloud Native Cluster Monitoring

100 * (sum(rate(container_cpu_usage_seconds_total{image!="", container!="POD"}[1m])) by (cluster_name,pod,node,namespace,container, cluster) / sum(kube_pod_container_resource_limits{resource="cpu"}) by (cluster_name,pod,node,namespace,container, cluster)) > 80

Container memory usage higher than 80%

Check whether the container memory usage is higher than 80%.

Metric

Cloud Native Cluster Monitoring

(sum(container_memory_working_set_bytes{image!="", container!="POD"}) BY (cluster_name, node,container, pod , namespace, cluster) / sum(container_spec_memory_limit_bytes > 0) BY (cluster_name, node, container, pod , namespace, cluster) * 100) > 80

Abnormal container

Check whether the container is running normally.

Metric

Cloud Native Cluster Monitoring

sum by (namespace, pod, container, cluster_name, cluster) (kube_pod_container_status_waiting_reason) > 0

Load balancer update failed

Check whether a load balancer is updated.

Event

Cloud Native Logging

N/A

Pod OOM

Check whether OOM occurs on the pod.

Event

CCE Node Problem Detector (1.18.41 or later)

Cloud Native Logging (1.3.2 or later)

PodOOMKilling

Node resource rule set

High usage of Kubernetes PV

Check whether the PV usage on a node is too high.

Metric

Cloud Native Cluster Monitoring

(kubelet_volume_stats_available_bytes{job="kubelet"} / kubelet_volume_stats_capacity_bytes{job="kubelet"}) < 0.03 and kubelet_volume_stats_used_bytes{job="kubelet"} > 0

Abnormal Kubernetes PVC

Check whether the PVC is normal.

Metric

Cloud Native Cluster Monitoring

kube_persistentvolumeclaim_status_phase{phase=~"Failed|Pending|Lost"} > 0

Abnormal Kubernetes PV

Check whether the PV is normal.

Metric

Cloud Native Cluster Monitoring

kube_persistentvolume_status_phase{phase=~"Failed|Pending"} > 0

Node CPU usage higher than 80%

Check whether the node CPU usage is higher than 80%.

Metric

Cloud Native Cluster Monitoring

100 - (avg by(node, cluster_name, cluster) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80

Available node memory less than 10%

Check whether the available node memory is less than 10%.

Metric

Cloud Native Cluster Monitoring

node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10

Available node disk space less than 10%

Check whether the available node disk space is less than 10%.

Metric

Cloud Native Cluster Monitoring

avg((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes) by (device, node, cluster_name, cluster) < 10

Insufficient node disk space

Check whether the node disk space is sufficient.

Event

Cloud Native Logging

N/A

emptyDir storage pool error

Check whether the node's EV storage pool is functional.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="EmptyDirVolumeGroupStatusError"} >= 1

Insufficient node memory

Check whether the overall node memory is sufficient.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="MemoryProblem"} >= 1

PV storage pool error

Check whether the node's PV storage pool is functional.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="LocalPvVolumeGroupStatusError"} >= 1

Abnormal node mount point

Check whether the node's mount point is available.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="MountPointProblem"} >= 1

Insufficient node file handles

Check whether the FD file handles are sufficient.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="FDProblem"} >= 1

Node disk I/O suspension

Check whether I/O suspension occurs on the node disk.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="DiskHung"} >= 1

Node disk read-only

Check whether the node disk is read-only.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="DiskReadonly"} >= 1

Abnormal node disk

Check the usage of the node's system disk and CCE data disks (including Docker and kubelet logical disks).

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="DiskProblem"} >= 1

Slow node disk I/O

Check whether slow I/O occurs on the node disk.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="DiskSlow"} >= 1

Insufficient node PIDs

Check whether the PIDs are sufficient.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="PIDProblem"} >= 1

Node conntrack table full

Check whether the node's conntrack table space is sufficient.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="ConntrackFullProblem"} >= 1

Node status rule set

ResolvConf error

Check whether the ResolvConf configuration file is available.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="ResolvConfFileProblem"} >= 1

Abnormal node CNI component

Check whether the CNI component of the node is running properly.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="CNIProblem"} >= 1

Abnormal node CRI component

Check the running of the key component CRI (Docker or containerd).

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="CRIProblem"} >= 1

Node kube-proxy error

Check whether kube-proxy is running properly.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="KUBEPROXYProblem"} >= 1

Abnormal node kubelet

Check whether kubelet is running normally.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="KUBELETProblem"} >= 1

Scheduled event on the node

Check whether there is a scheduled event on the node.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="ScheduledEvent"} >= 1

Unstable node status

Check whether the node status alternates between normal and abnormal.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

sum(changes(kube_node_status_condition{status="true",condition="Ready"}[15m])) by (cluster_name, node, cluster) > 2

Frequent node containerd restarts

Check whether containerd frequently restarts.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="FrequentContainerdRestart"} >= 1

Node task suspended

Check whether a task is suspended on the node.

Event

Cloud Native Logging

TaskHung

Incorrect node storage pool configuration

Check whether the node's EV and PV storage pools are correctly configured.

Event

Cloud Native Logging

InvalidStoragePool

Abnormal node

Check whether the node is running normally.

Event

Cloud Native Logging

NodeNotReady

Abnormal node process D

Check whether there is a D state process on the node.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="ProcessD"} >= 1

Abnormal node process Z

Check whether there is a Z state process on the node.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="ProcessZ"} >= 1

Frequent node CRI restarts

Check whether CRI frequently restarts.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="FrequentCRIRestart"} >= 1

Frequent node Docker restarts

Check whether Docker frequently restarts.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="FrequentDockerRestart"} >= 1

Frequent node kubelet restarts

Check whether kubelet frequently restarts.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="FrequentKubeletRestart"} >= 1

Node NTP service error

Check whether the node clock synchronization service ntpd or chronyd is running properly.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="NTPProblem"} >= 1

Processes forcibly stopped due to node OOM

Check whether an OOM event occurred on the node.

Event

CCE Node Problem Detector

OOMKilling

Node scaling rule set

Node pool sold out

Check whether the node pool resources are sufficient.

Event

Cloud Native Logging

NodePoolSoldOut

Scale-out timed out

Check whether adding nodes to the node pool timed out.

Event

Cloud Native Logging

ScaleUpTimedOut

Node pool scale-out failed

Check whether an error occurred during a node pool scale-out.

Event

Cloud Native Logging

FailedToScaleUpGroup

Node pool scale-in failed

Check whether an error occurred during a node pool scale-in.

Event

Cloud Native Logging

ScaleDownFailed

Cluster status rule set

Unavailable cluster

Check whether the cluster is available.

Event

Cloud Native Logging

N/A

Configuring Alarm Notification Recipients

A contact group, backed on Simple Message Notification, enables message publishers and subscribers to contact each other. A contact group contains one or more endpoints. You can configure contact groups to manage endpoints that have subscribed to alarm messages. After creating a contact group, associate alarm rule set with the group. When an alarm is triggered, the subscription endpoints in the contact group can receive the alarm messages.

  1. Log in to the CCE console.
  2. On the cluster list page, click the cluster name to access the cluster console.
  3. In the navigation pane on the left, choose Alarm Center. Then, click the Contact Group tab.
  4. Click Create Contact Group and configure parameters.

    • Contact Group Name: Enter the name of the contact group, which cannot be changed after the contact group is created. The name can contain 1 to 255 characters and must start with a letter or digit. Only letters, digits, hyphens (-), and underscores (_) are allowed.
    • Alarm message display name: Enter the title of the message received by the specified subscription endpoint. For example, if you set Terminal Type to Email and specify a display name, the name you specified will be displayed as the alarm message sender. If you do not specify Alarm message display name, the sender will be username@example.com. The display name of an alarm message can be changed after the contact group is created.
    • Add Subscription Terminal: Add one or more endpoints to receive alarm messages. The endpoint type can be SMS or Email. If you select SMS, enter a valid mobile number. If you select Email, enter a valid email address.

  5. Click OK.

    You will be redirected to the contact group list. The subscription endpoint is in the Unconfirmed state. Send a subscription request to the endpoint to verify the validity of the endpoint.

  6. Click Request Confirmation in the Operation column to send a subscription request to the endpoint. If the endpoint receives the request, confirm the request as prompted. After the confirmation is complete, the subscription endpoint changes to Confirmed.
  7. Click to enable the contact group so that the contact group is bound to the alarm rule set.

    An alarm rule set can be bound to a maximum of five contact groups.

Viewing Alarms

You can view the latest historical alarms on the Alarms tab.

  1. Log in to the CCE console.
  2. On the cluster list page, click the cluster name to access the cluster console.
  3. In the navigation pane on the left, choose Alarm Center. Then, click the Alarms tab.

    By default, all alarms to be cleared are displayed in the list. You can query alarms by alarm keyword, alarm severity, or alarm time. In addition, you can view the distribution of alarms that meet the specified criteria in different periods.

    If an alarm to be cleared is not triggered within 10 minutes, the alarm is considered cleared by default and converted to a historical alarm. If you confirm that an alarm has been handled in advance, you can also click Clear in the Operation column. You can view this cleared alarm in the historical alarm list.

    Figure 1 Querying alarms