Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Situation Awareness
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive
Help Center/ Cloud Container Engine/ User Guide/ O&M/ Alarm Center/ Configuring Alarms in Alarm Center

Configuring Alarms in Alarm Center

Updated on 2025-02-18 GMT+08:00

By using AOM, Alarm Center can promptly detect cluster faults and generate alarms for service stability. Alarm Center provides built-in alarm rules, which can free you from manually configuring alarm rules on AOM. These rules are established based on the extensive cluster O&M experience of our Huawei Cloud container team and can cover container service exceptions, key metric alarms of basic cluster resources, and metric alarms of applications in a cluster to meet your routine O&M requirements.

Constraints

  • The cluster version must be v1.17 or later.
  • Only Huawei Cloud accounts, HUAWEI IDs, or IAM users with CCE administrator or FullAccess permissions can perform all operations using Alarm Center. IAM users with the CCE ReadOnlyAccess permission can only view all resources.

Enabling Alarm Center

Alarm Center can be enabled for CCE standard clusters and CCE Turbo clusters.

  1. Click the cluster name to access the cluster console. In the navigation pane, choose Alarm Center.
  2. On the Alarm Rules tab, click Enable Alarm Center. In the window that slides out from the right, select one or more contact groups to manage subscription endpoints and receive alarm messages by group. If no contact group is available, create one by referring to Binding Contact Groups.
  3. Click OK.

    NOTE:

    Metric alarm rules can be created in Alarm Center only after the Cloud Native Cluster Monitoring add-on is installed and the AOM Prometheus instance is interconnected. For details about how to enable Monitoring Center, see Enabling Monitoring Center.

    The alarm rules that use the problem_gauge metric in Table 1 depend on the CCE Node Problem Detector add-on (CCE Node Problem Detector). To use related alarm rules, ensure that the CCE Node Problem Detector add-on has been installed and is running normally.

    Event alarms in Table 1 can be reported only when Kubernetes event collection is enabled in Logging. For details, see Collecting Kubernetes Events.

Configuring Alarm Rules

After Alarm Center is enabled for CCE standard clusters and CCE Turbo clusters, you can configure and manage alarm rules.

  1. Log in to the CCE console.
  2. On the cluster list page, click the cluster name to access the cluster console.
  3. In the navigation pane, choose Alarm Center. Then, click the Alarm Rules tab and configure and manage alarm rules.

    By default, Alarm Center generates alarm rules for containers. The rules are intended for alarms including event alarms and metric alarms for exceptions. Alarm rules are classified into several sets. You can associate an alarm rule set with multiple contact groups and enable or disable alarm items. An alarm rule set consists of multiple alarm rules. An alarm rule corresponds to the check items for a single exception. Table 1 lists default alarm rules.

Table 1 Default alarm rules

Rule Type

Alarm Item

Description

Alarm Type

Dependency Item

PromQL/Event Name

Load rule set

Abnormal pod

Check whether the pod is running normally.

Metric

Cloud Native Cluster Monitoring

sum(min_over_time(kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}[10m]) and count_over_time(kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}[10m]) > 18 )by (namespace,pod, phase, cluster_name, cluster) > 0

Frequent pod restarts

Check whether the pod frequently restarts.

Metric

Cloud Native Cluster Monitoring

increase(kube_pod_container_status_restarts_total[5m]) > 3

Unexpected number of Deployment replicas

Check whether the number of Deployment replicas is the same as the expected value.

Metric

Cloud Native Cluster Monitoring

(kube_deployment_spec_replicas != kube_deployment_status_replicas_available ) and ( changes(kube_deployment_status_replicas_updated[5m]) == 0)

Unexpected number of StatefulSet replicas

Check whether the number of StatefulSet replicas is the same as the expected value.

Metric

Cloud Native Cluster Monitoring

(kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas) and (changes(kube_statefulset_status_replicas_updated[5m]) == 0)

Container CPU usage higher than 80%

Check whether the container CPU usage is higher than 80%.

Metric

Cloud Native Cluster Monitoring

100 * (sum(rate(container_cpu_usage_seconds_total{image!="", container!="POD"}[1m])) by (cluster_name,pod,node,namespace,container, cluster) / sum(kube_pod_container_resource_limits{resource="cpu"}) by (cluster_name,pod,node,namespace,container, cluster)) > 80

Container memory usage higher than 80%

Check whether the container memory usage is higher than 80%.

Metric

Cloud Native Cluster Monitoring

(sum(container_memory_working_set_bytes{image!="", container!="POD"}) BY (cluster_name, node,container, pod , namespace, cluster) / sum(container_spec_memory_limit_bytes > 0) BY (cluster_name, node, container, pod , namespace, cluster) * 100) > 80

Abnormal container

Check whether the container is running normally.

Metric

Cloud Native Cluster Monitoring

sum by (namespace, pod, container, cluster_name, cluster) (kube_pod_container_status_waiting_reason) > 0

Load balancer update failed

Check whether a load balancer is updated.

Event

Cloud Native Log Collection

N/A

Pod OOM

Check whether OOM occurs on the pod.

Event

CCE Node Problem Detector (1.18.41 or later)

Cloud Native Log Collection (1.3.2 or later)

PodOOMKilling

Node resource rule set

High usage of Kubernetes PV

Check whether the PV usage on a node is too high.

Metric

Cloud Native Cluster Monitoring

(kubelet_volume_stats_available_bytes{job="kubelet"} / kubelet_volume_stats_capacity_bytes{job="kubelet"}) < 0.03 and kubelet_volume_stats_used_bytes{job="kubelet"} > 0

Abnormal Kubernetes PVC

Check whether the PVC is normal.

Metric

Cloud Native Cluster Monitoring

kube_persistentvolumeclaim_status_phase{phase=~"Failed|Pending|Lost"} > 0

Abnormal Kubernetes PV

Check whether the PV is normal.

Metric

Cloud Native Cluster Monitoring

kube_persistentvolume_status_phase{phase=~"Failed|Pending"} > 0

Node CPU usage higher than 80%

Check whether the node CPU usage is higher than 80%.

Metric

Cloud Native Cluster Monitoring

100 - (avg by(node, cluster_name, cluster) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80

Available node memory less than 10%

Check whether the available node memory is less than 10%.

Metric

Cloud Native Cluster Monitoring

node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10

Available node disk space less than 10%

Check whether the available node disk space is less than 10%.

Metric

Cloud Native Cluster Monitoring

avg((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes) by (device, node, cluster_name, cluster) < 10

Insufficient node disk space

Check whether the node disk space is sufficient.

Event

Cloud Native Log Collection

N/A

emptyDir storage pool error

Check whether the node's EV storage pool is functional.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="EmptyDirVolumeGroupStatusError"} >= 1

Insufficient node memory

Check whether the overall node memory is sufficient.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="MemoryProblem"} >= 1

PV storage pool error

Check whether the node's PV storage pool is functional.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="LocalPvVolumeGroupStatusError"} >= 1

Abnormal node mount point

Check whether the node's mount point is available.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="MountPointProblem"} >= 1

Insufficient node file handles

Check whether the FD file handles are sufficient.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="FDProblem"} >= 1

Node disk I/O suspension

Check whether I/O suspension occurs on the node disk.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="DiskHung"} >= 1

Node disk read-only

Check whether the node disk is read-only.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="DiskReadonly"} >= 1

Abnormal node disk

Check the usage of the node's system disk and CCE data disks (including Docker and kubelet logical disks).

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="DiskProblem"} >= 1

Slow node disk I/O

Check whether slow I/O occurs on the node disk.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="DiskSlow"} >= 1

Insufficient node PIDs

Check whether the PIDs are sufficient.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="PIDProblem"} >= 1

Node conntrack table full

Check whether the node's conntrack table space is sufficient.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="ConntrackFullProblem"} >= 1

Node status rule set

ResolvConf error

Check whether the ResolvConf configuration file is available.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="ResolvConfFileProblem"} >= 1

Abnormal node CNI component

Check whether the CNI component of the node is running properly.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="CNIProblem"} >= 1

Abnormal node CRI component

Check the running of the key component CRI (Docker or containerd).

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="CRIProblem"} >= 1

Node kube-proxy error

Check whether kube-proxy is running properly.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="KUBEPROXYProblem"} >= 1

Abnormal node kubelet

Check whether kubelet is running normally.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="KUBELETProblem"} >= 1

Scheduled event on the node

Check whether there is a scheduled event on the node.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="ScheduledEvent"} >= 1

Unstable node status

Check whether the node status alternates between normal and abnormal.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

sum(changes(kube_node_status_condition{status="true",condition="Ready"}[15m])) by (cluster_name, node, cluster) > 2

Frequent node containerd restarts

Check whether containerd frequently restarts.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="FrequentContainerdRestart"} >= 1

Node task suspended

Check whether a task is suspended on the node.

Event

Cloud Native Log Collection

TaskHung

Incorrect node storage pool configuration

Check whether the node's EV and PV storage pools are correctly configured.

Event

Cloud Native Log Collection

InvalidStoragePool

Abnormal node

Check whether the node is running normally.

Event

Cloud Native Log Collection

NodeNotReady

Abnormal node process D

Check whether there is a D state process on the node.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="ProcessD"} >= 1

Abnormal node process Z

Check whether there is a Z state process on the node.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="ProcessZ"} >= 1

Frequent node CRI restarts

Check whether CRI frequently restarts.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="FrequentCRIRestart"} >= 1

Frequent node Docker restarts

Check whether Docker frequently restarts.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="FrequentDockerRestart"} >= 1

Frequent node kubelet restarts

Check whether kubelet frequently restarts.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="FrequentKubeletRestart"} >= 1

Node NTP service error

Check whether the node clock synchronization service ntpd or chronyd is running properly.

Metric

Cloud Native Cluster Monitoring

CCE Node Problem Detector

problem_gauge{type="NTPProblem"} >= 1

Processes forcibly stopped due to node OOM

Check whether an OOM event occurred on the node.

Event

CCE Node Problem Detector

OOMKilling

Node scaling rule set

Node pool resources sold out

Check whether the node pool resources are sufficient.

Event

Cloud Native Log Collection

NodePoolSoldOut

Scale-out timed out

Check whether adding nodes to the node pool timed out.

Event

Cloud Native Log Collection

ScaleUpTimedOut

Node pool scale-out failed

Check whether an error occurred during a node pool scale-out.

Event

Cloud Native Log Collection

FailedToScaleUpGroup

Node pool scale-in failed

Check whether an error occurred during a node pool scale-in.

Event

Cloud Native Log Collection

ScaleDownFailed

Cluster status rule set

Unavailable cluster

Check whether the cluster is available.

Event

Cloud Native Log Collection

N/A

Binding Contact Groups

NOTE:

An alarm rule can be bound to a maximum of five contact groups.

A contact group, backed on Simple Message Notification, enables message publishers and subscribers to contact each other. A contact group contains one or more endpoints. You can attach an alarm rule to a contact group to manage endpoints that have subscribed to alarm messages.

  1. Log in to the CCE console.
  2. On the cluster list page, click the cluster name to access the cluster console.
  3. In the navigation pane, choose Alarm Center. Then, click the Default Contact Groups tab.
  4. Click Bind Contact Group. You can select a contact group created in SMN or create a contact group. The parameters for creating a contact group are described as follows:

    • Contact Group Name: Enter the name of the contact group, which cannot be changed after the contact group is created. The name can contain 1 to 255 characters and must start with a letter or digit. Only letters, digits, hyphens (-), and underscores (_) are allowed.
    • Alarm Message Display Name: Enter the title of the message received by the specified subscription endpoint. For example, if you set Terminal Type to Email and specify a display name, the name you specified will be displayed as the alarm message sender. If no alarm message display name is specified, the sender will be username@example.com. The alarm message display name can be changed after a contact group is created.
    • Add Subscription Terminal: Add one or more endpoints to receive alarm messages. The endpoint type can be SMS or Email. If you select SMS, enter a valid mobile number. If you select Email, enter a valid email address.

  5. Click OK.

    You will be redirected to the contact group list. The subscription endpoint is in the Unconfirmed state. Send a subscription request to the endpoint to verify the validity of the endpoint.

  6. Click Request Confirmation in the Operation column to send a subscription request to the endpoint. After the endpoint receives and confirms the request, the subscription endpoint status changes to Confirmed.

Viewing Alarms

You can view the latest historical alarms on the Alarms tab.

  1. Log in to the CCE console.
  2. On the cluster list page, click the cluster name to access the cluster console.
  3. In the navigation pane, choose Alarm Center. Then, click the Alarms tab.

    By default, all alarms to be cleared are displayed in the list. You can query alarms by alarm keyword, alarm severity, or alarm time. In addition, you can view the distribution of alarms that meet the specified criteria in different periods.

    If an alarm to be cleared is not triggered within 10 minutes, the alarm is considered cleared by default and converted to a historical alarm. If you confirm that an alarm has been handled in advance, you can also click Clear in the Operation column. You can view this cleared alarm in the historical alarm list.

    Figure 1 Querying alarms

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback