Help Center/ Cloud Container Engine/ Skill Reference/ Huawei Cloud Cloud-Native Skill Best Practices/ Using AI CLI to Configure, Query, and Manage CCE AOM Alarms
Updated on 2026-06-05 GMT+08:00

Using AI CLI to Configure, Query, and Manage CCE AOM Alarms

Scenarios

After a CCE cluster is brought online, the O&M team usually needs to initialize alarm rules as soon as possible and continuously pay attention to active alarms, historical alarms, recovery records, and notification rules during routine troubleshooting. By interconnecting AI CLI Agent with the alarm-correlation-engine Skill, you can use natural language to configure CCE AOM alarm rules, query alarms, aggregate and analyze alarms, and trace the root causes of major alarms.

Compared with viewing a single alarm list, this practice recommends that you analyze cluster alarms by time window. AI CLI automatically merges alarms of the same type, marks alarm severities, and distinguishes between normal alarms, unexpected alarms, and alarms that persist. This helps customers quickly determine which alarms are the most urgent, which resources are affected, and what to check next.

AI CLI can be used to:

  • Create recommended rules in the AOM alarm center for a specified CCE cluster in one click.
  • Automatically create a default cluster-level notification rule if no notification rule is specified.
  • Preview the number of rules, rule types, notification methods, and missing parameters before the rule creation. Create rules after customer confirmation.
  • Automatically query the alarm rule list to confirm that 50 rules have been created.
  • Query active alarms, historical alarms, and clearance records by cluster and time window.
  • Deduplicate, group, and mark the severity of alarms, identify burst, persistent, and normal alarms, and analyze root cause clues.

This practice uses the test-ai-diagnoses cluster in CN North-Beijing4 as an example to demonstrate how to configure, query, and aggregate and analyze alarms using natural language. In this example, CCE AOM alarm rules are configured in batches for the cluster, and the alarm situation in the last four hours is analyzed.

Constraints

  • You do not need to specify an existing notification rule. If no notification rule is specified, AI CLI automatically invokes the default cluster-level notification rule.
  • When a notification rule is automatically created, an available SMN topic name or topic URN must be provided.
  • Metric alarms depend on the AOM CCE Prometheus instance associated with the target cluster.
  • Event alarms depend on the CCE event reporting link. You are advised to ensure that the log collection and node fault detection add-ons are normal.
  • Alarm rules can be created only after being confirmed by the target customer.
  • Severity marking is only an auxiliary tool for troubleshooting and cannot replace the production change approval and manual confirmation mechanisms.
  • You are advised to explicitly specify a time window for alarm analysis, such as the last 30 minutes, last 4 hours, or a specific start and end time.
  • Do not write AKs/SKs, tokens, certificates, or real project IDs into documents, code, or dialog output.

Prerequisites

  • You have registered the alarm-correlation-engine Skill in AI CLI.
  • You have configured Huawei Cloud access credentials. You are advised to inject them through environment variables or security credential files to avoid exposing AKs/SKs in dialogs or documents.
  • You have associated an AOM CCE Prometheus instance with the target CCE cluster.
  • You have prepared an available SMN topic, for example, topic named test. If an AOM notification rule already exists, you can reuse it.
  • The execution account has the permissions to query and create AOM alarm rules, and query and create notification rules.
  • The creation of alarm rules has been confirmed by the target customer.

Involved Skills

Skill

Function

alarm-correlation-engine

Queries AOM alarms and active alarms, merges and analyzes alarms, creates alarm rules, and queries notification rules.

huawei-cloud-cce-cluster-management

Queries the CCE cluster list and confirms the cluster name, ID, and status.

observability-context-builder

Continues to aggregate metrics, logs, and Kubernetes event contexts after alarm analysis.

Procedure

Step 1: Create CCE Cluster Alarm Rules in One Click

You can use AI CLI to intelligently configure default AOM alarm rules for the target cluster based on the CCE cluster alarm best practices. You only need to specify the region, cluster name, and notification topic. AI CLI will automatically understand the configuration intent, supplement cluster information, generate a creation plan, and complete rule creation and result verification after your confirmation.

Enter the following in the OpenClaw dialog box:

Help me create CCE AOM alarm rules in batches for the test-ai-diagnoses cluster in CN North-Beijing4. Use the test topic for subscription.

AI CLI generates an alarm rule configuration plan to help confirm the creation scope and notification method.

Intelligent Processing Item

Description

Identifying a cluster

AI CLI queries a CCE cluster by region and cluster name, and checks the cluster ID, status, and configurability.

Supplementing notification configuration

AI CLI preferentially reuses existing AOM notification rules. If no notification rule is specified, it creates a default cluster-level notification rule based on the SMN topic.

Generating a rule plan

AI CLI generates 50 default alarm rules based on the recommended practices for CCE clusters, distinguishing between metric alarms and event alarms.

Previewing impact scope

AI CLI displays the number of rules, rule types, notification rules, and missing parameters. No rule is directly created.

Waiting for customer confirmation

AI CLI creates rules only after the customer confirms them.

Automatically creating rules

AI CLI creates default alarm rules in batches. If a rule with the same name already exists, it automatically skips the creation.

Automatically performing result acceptance

After the rule creation is complete, AI CLI queries the alarm rule list immediately to confirm the number of rules, rule types, and number of failed rules.

The 50 rules created by default include:

Type

Quantity

Example

Prometheus metric alarms

38

Abnormal pod status, frequent pod restarts, node CPU usage, node disk availability, and abnormal node kubelet

CCE event alarms

12

Pod memory OOM, insufficient node disk space, abnormal node status, node scale-out timeout, and unavailable cluster

After AI CLI displays the preview, you need to confirm that the number of rules, notification method, and target cluster are correct, and then reply:

Start the creation.

After the execution is complete, AI CLI returns the creation result and automatically queries and confirms that the rules have been created. Focus on the following acceptance items:

Check Item

Expected Result

Total number of rules for test-ai-diagnoses

50

Metric alarms

38

Event alarms

12

Creation failed

0

If some rules already exist, AI CLI skips the rules with the same names to avoid duplicate creation. You can also enter the following content at any time to check:

Query the alarm rules of the test-ai-diagnoses cluster in CN North-Beijing4.

Step 2: Aggregate and Analyze Alarms by Time Window and Trace Root Causes

To check whether a cluster has risks, you are advised to specify a time window so that AI CLI can query active alarms, historical alarms, and clearance records at the same time, and aggregate and analyze alarms by severity.

Analyze the AOM alarms of the test-ai-diagnoses cluster in CN North-Beijing4 in the last 4 hours, aggregate the alarms by severity, mark normal and burst alarms, and tell me the three issues that need to be handled first.

You can also specify an exact time range to analyze the change window, fault window, or shift handover window.

Analyze the AOM alarms of the test-ai-diagnoses cluster in CN North-Beijing4 from 09:00 to 12:00 today, aggregate the alarms by severity, and output the first occurrence time and handling priority of burst alarms.

AI CLI returns the alarm situation summary of the current time window. Pay attention to the following information:

Output Item

Example

Total alarms

42 alarms detected in the last 4 hours, 7 of which are not cleared.

Alarms by severity

Critical: 1, High: 3, Medium: 9, Low: 29

Alarms by type

3 groups of burst alarms, 2 groups of uncleared alarms, and 5 groups of frequent alarms

First occurrence time

The earliest burst alarm occurred at 09:17, and the burst occurred from 09:20 to 09:35.

Major affected objects

kube-system, default/nginx-demo, and node 192.168.0.12

Handling priority suggestion

The current risks are mainly related to node resource pressure and service pod restart. You are advised to first analyze the recent changes of node disks and pods.

Then, AI CLI aggregates multiple original alarms into alarm groups and sorts them by severity. You can handle the alarms in the following sequence:

Priority

Severity

Alarm Group

Alarm Characteristic

First Occurrence Time

Judgment

Handling Suggestion

P0

Critical

Unavailable cluster or abnormal core component

Burst alarms, not cleared

09:17

The cluster control plane or service scheduling capability may be affected.

Analyze the root cause immediately and check CCE events, core component pods, and AOM metrics.

P1

High

Node disk space insufficient, associated with multiple pod eviction events

Persistent alarms, not cleared

09:24

Node resource pressure may cause unstable service replicas.

Check the pods, eviction events, and disk metrics on the associated node.

P2

Medium

Frequent pod restarts, mainly in default/nginx-demo

Burst alarms, some cleared

10:06

Possibly related to recent releases, probe configurations, or resource limits

Query pod logs, events, and Deployment version history.

P3

Low

Short-term CPU threshold alarms

Frequent alarms, automatically cleared

Multiple occurrences in the last seven days

No continuous impact. The traffic may fluctuate for a short period of time.

Observe the trend and adjust the threshold or HPA policy if necessary.

For critical, high, or user-concerned alarm groups, you can continue to use AI CLI to analyze the root causes in the same time window.

Continue to analyze P1 warning. Keep the time window to the last 4 hours. Help me associate the pods on this node, recent events, related metrics, and possible root causes.

AI CLI continues to query the context of the alarm group and provides root cause clues.

Root Cause Analysis Item

Available Information

Related resources

Alarm node, affected pods, namespaces, workloads, and Services

Related events

Events such as eviction, scheduling failure, probe failure, image pull failure, and node exception

Related metrics

Trends of CPU, memory, disk, and network resources, and changes before and after the alarm is triggered

Timeline

First occurrence time, burst time, recovery time, and related event occurrence time of the alarm

Preliminary root cause

For example, node disk pressure, workload release exception, core component exception, or insufficient capacity

Suggestions for the next step

Continue troubleshooting, clear resources, scale out, adjust thresholds, roll back the release, or keep observing.

You can also ask AI CLI to output only critical and uncleared alarm groups:

View only the high- and critical-severity alarms that are not cleared in the last 4 hours, group them by affected resource, and provide the first occurrence time and root cause analysis suggestions for each group.

If the alarm is related to service unavailability, you can ask AI CLI to output a complete root cause analysis and recovery suggestions:

Continue root cause analysis based on these critical alarms, keep the time window from 09:00 to 12:00, and provide recovery suggestions.

To view the original details, enter the following:

Display the alarm details of the last 4 hours, including the alarm name, status, severity, resource, first occurrence time, and description.

Expected Results

After completing this practice, you can use AI CLI to complete the following closed-loop operations:

  • Create 50 AOM alarm rules for the target cluster in one click based on the CCE cluster alarm best practices.
  • Automatically identify the target cluster, prepare notification rules, preview the creation plan, and execute the creation after user confirmation.
  • Automatically query the alarm rule list to confirm that rules for 38 metric alarms, 12 event alarms, and 0 failed alarms have been created.
  • Query active alarms, historical alarms, and clearance records by cluster and time window.
  • Automatically aggregate, deduplicate, and mark the severity of alarms to distinguish between normal alarms, burst alarms, and alarms that remain uncleared.
  • Output the first occurrence time, burst time, affected resources, and handling queue sorted by priority.
  • For critical and high-severity alarms, associate events, logs, metrics, and resource statuses within the same time window, and output possible root causes and the next diagnosis path.

FAQs

Prometheus Instance Not Found

  • Symptom

    Example error:

    {
      "success": false,
      "error": "The Prometheus instance corresponding to the target cluster is not found."
    }
  • Handling suggestion
    1. Query the AOM Prometheus instance and check whether a CCE instance is associated with the target cluster.
    2. Query the CCE add-on and check whether the components related to cloud native monitoring have been installed.
    3. If monitoring is just enabled on the console, wait until the instance association information is synchronized and try again.

Why Is the Number of Queried Notification Rules Inconsistent with That on the Console?

You are advised to use AI CLI to query the AOM notification rules in the current region again. If the inconsistency persists, check whether the current credential, project, and region match those on the console.

Why Does Automatic Notification Rule Creation Fail?

Check whether the SMN topic exists and whether the current account has permission to access AOM notification rules and SMN topics. If yes, create the notification rule again.

There Are Many Alarms, But I Don't Know Which One to Handle First

You are advised to use AI CLI to re-aggregate and analyze alarms by time window and output the severity and handling priority. For example:

Aggregate AOM alarms of the test-ai-diagnoses cluster in the last 4 hours by resource and severity, and output the five alarm groups that need to be handled first.

If there are still a large number of high- or critical-severity alarms after aggregation, narrow down the scope and analyze the alarm groups that are not cleared, affect core namespaces, are associated with multiple resources, or burst.

Follow-up Suggestions

  • After the rules are created, you are advised to query the alarm rules of the target cluster immediately to ensure that the number of rules, notification method, and enabling status meet your expectations.
  • At the early stage after the new cluster rollout, you are advised to check active alarms and high-frequency historical alarms every day to check whether there are alarms with strict thresholds, repeated notifications, or alarms that are not cleared for a long time.
  • You are advised to use a fixed time window for analysis during on-duty troubleshooting, for example, "last 30 minutes", "last 4 hours", or "after the change". This avoids interference from irrelevant historical alarms.
  • For critical and high-severity alarm groups, you are advised to continue associating logs, events, metrics, and workload versions to form a root cause analysis link.
  • For high-frequency alarms that have no impact on services, you are advised to analyze the triggering objects and time periods, and then evaluate the threshold or notification scope optimization solution.
  • For production clusters, you are advised to periodically export the alarm rule list as an important material for change audit and fault review.
  • After alarms are triggered, you are advised to check the alarm aggregation and severity marking results, and then query related logs, events, and metrics. Do not perform recovery actions based on a single alarm.

Helpful Links