Using AI CLI to Configure, Query, and Manage CCE AOM Alarms
Scenarios
After a CCE cluster is brought online, the O&M team usually needs to initialize alarm rules as soon as possible and continuously pay attention to active alarms, historical alarms, recovery records, and notification rules during routine troubleshooting. By interconnecting AI CLI Agent with the alarm-correlation-engine Skill, you can use natural language to configure CCE AOM alarm rules, query alarms, aggregate and analyze alarms, and trace the root causes of major alarms.
Compared with viewing a single alarm list, this practice recommends that you analyze cluster alarms by time window. AI CLI automatically merges alarms of the same type, marks alarm severities, and distinguishes between normal alarms, unexpected alarms, and alarms that persist. This helps customers quickly determine which alarms are the most urgent, which resources are affected, and what to check next.
AI CLI can be used to:
- Create recommended rules in the AOM alarm center for a specified CCE cluster in one click.
- Automatically create a default cluster-level notification rule if no notification rule is specified.
- Preview the number of rules, rule types, notification methods, and missing parameters before the rule creation. Create rules after customer confirmation.
- Automatically query the alarm rule list to confirm that 50 rules have been created.
- Query active alarms, historical alarms, and clearance records by cluster and time window.
- Deduplicate, group, and mark the severity of alarms, identify burst, persistent, and normal alarms, and analyze root cause clues.
This practice uses the test-ai-diagnoses cluster in CN North-Beijing4 as an example to demonstrate how to configure, query, and aggregate and analyze alarms using natural language. In this example, CCE AOM alarm rules are configured in batches for the cluster, and the alarm situation in the last four hours is analyzed.
Constraints
- You do not need to specify an existing notification rule. If no notification rule is specified, AI CLI automatically invokes the default cluster-level notification rule.
- When a notification rule is automatically created, an available SMN topic name or topic URN must be provided.
- Metric alarms depend on the AOM CCE Prometheus instance associated with the target cluster.
- Event alarms depend on the CCE event reporting link. You are advised to ensure that the log collection and node fault detection add-ons are normal.
- Alarm rules can be created only after being confirmed by the target customer.
- Severity marking is only an auxiliary tool for troubleshooting and cannot replace the production change approval and manual confirmation mechanisms.
- You are advised to explicitly specify a time window for alarm analysis, such as the last 30 minutes, last 4 hours, or a specific start and end time.
- Do not write AKs/SKs, tokens, certificates, or real project IDs into documents, code, or dialog output.
Prerequisites
- You have registered the alarm-correlation-engine Skill in AI CLI.
- You have configured Huawei Cloud access credentials. You are advised to inject them through environment variables or security credential files to avoid exposing AKs/SKs in dialogs or documents.
- You have associated an AOM CCE Prometheus instance with the target CCE cluster.
- You have prepared an available SMN topic, for example, topic named test. If an AOM notification rule already exists, you can reuse it.
- The execution account has the permissions to query and create AOM alarm rules, and query and create notification rules.
- The creation of alarm rules has been confirmed by the target customer.
Involved Skills
| Skill | Function |
|---|---|
| alarm-correlation-engine | Queries AOM alarms and active alarms, merges and analyzes alarms, creates alarm rules, and queries notification rules. |
| huawei-cloud-cce-cluster-management | Queries the CCE cluster list and confirms the cluster name, ID, and status. |
| observability-context-builder | Continues to aggregate metrics, logs, and Kubernetes event contexts after alarm analysis. |
Procedure
Step 1: Create CCE Cluster Alarm Rules in One Click
You can use AI CLI to intelligently configure default AOM alarm rules for the target cluster based on the CCE cluster alarm best practices. You only need to specify the region, cluster name, and notification topic. AI CLI will automatically understand the configuration intent, supplement cluster information, generate a creation plan, and complete rule creation and result verification after your confirmation.
Enter the following in the OpenClaw dialog box:
Help me create CCE AOM alarm rules in batches for the test-ai-diagnoses cluster in CN North-Beijing4. Use the test topic for subscription.
AI CLI generates an alarm rule configuration plan to help confirm the creation scope and notification method.
| Intelligent Processing Item | Description |
|---|---|
| Identifying a cluster | AI CLI queries a CCE cluster by region and cluster name, and checks the cluster ID, status, and configurability. |
| Supplementing notification configuration | AI CLI preferentially reuses existing AOM notification rules. If no notification rule is specified, it creates a default cluster-level notification rule based on the SMN topic. |
| Generating a rule plan | AI CLI generates 50 default alarm rules based on the recommended practices for CCE clusters, distinguishing between metric alarms and event alarms. |
| Previewing impact scope | AI CLI displays the number of rules, rule types, notification rules, and missing parameters. No rule is directly created. |
| Waiting for customer confirmation | AI CLI creates rules only after the customer confirms them. |
| Automatically creating rules | AI CLI creates default alarm rules in batches. If a rule with the same name already exists, it automatically skips the creation. |
| Automatically performing result acceptance | After the rule creation is complete, AI CLI queries the alarm rule list immediately to confirm the number of rules, rule types, and number of failed rules. |
The 50 rules created by default include:
| Type | Quantity | Example |
|---|---|---|
| Prometheus metric alarms | 38 | Abnormal pod status, frequent pod restarts, node CPU usage, node disk availability, and abnormal node kubelet |
| CCE event alarms | 12 | Pod memory OOM, insufficient node disk space, abnormal node status, node scale-out timeout, and unavailable cluster |
After AI CLI displays the preview, you need to confirm that the number of rules, notification method, and target cluster are correct, and then reply:
Start the creation.
After the execution is complete, AI CLI returns the creation result and automatically queries and confirms that the rules have been created. Focus on the following acceptance items:
| Check Item | Expected Result |
|---|---|
| Total number of rules for test-ai-diagnoses | 50 |
| Metric alarms | 38 |
| Event alarms | 12 |
| Creation failed | 0 |
If some rules already exist, AI CLI skips the rules with the same names to avoid duplicate creation. You can also enter the following content at any time to check:
Query the alarm rules of the test-ai-diagnoses cluster in CN North-Beijing4.
Step 2: Aggregate and Analyze Alarms by Time Window and Trace Root Causes
To check whether a cluster has risks, you are advised to specify a time window so that AI CLI can query active alarms, historical alarms, and clearance records at the same time, and aggregate and analyze alarms by severity.
Analyze the AOM alarms of the test-ai-diagnoses cluster in CN North-Beijing4 in the last 4 hours, aggregate the alarms by severity, mark normal and burst alarms, and tell me the three issues that need to be handled first.
You can also specify an exact time range to analyze the change window, fault window, or shift handover window.
Analyze the AOM alarms of the test-ai-diagnoses cluster in CN North-Beijing4 from 09:00 to 12:00 today, aggregate the alarms by severity, and output the first occurrence time and handling priority of burst alarms.
AI CLI returns the alarm situation summary of the current time window. Pay attention to the following information:
| Output Item | Example |
|---|---|
| Total alarms | 42 alarms detected in the last 4 hours, 7 of which are not cleared. |
| Alarms by severity | Critical: 1, High: 3, Medium: 9, Low: 29 |
| Alarms by type | 3 groups of burst alarms, 2 groups of uncleared alarms, and 5 groups of frequent alarms |
| First occurrence time | The earliest burst alarm occurred at 09:17, and the burst occurred from 09:20 to 09:35. |
| Major affected objects | kube-system, default/nginx-demo, and node 192.168.0.12 |
| Handling priority suggestion | The current risks are mainly related to node resource pressure and service pod restart. You are advised to first analyze the recent changes of node disks and pods. |
Then, AI CLI aggregates multiple original alarms into alarm groups and sorts them by severity. You can handle the alarms in the following sequence:
| Priority | Severity | Alarm Group | Alarm Characteristic | First Occurrence Time | Judgment | Handling Suggestion |
|---|---|---|---|---|---|---|
| P0 | Critical | Unavailable cluster or abnormal core component | Burst alarms, not cleared | 09:17 | The cluster control plane or service scheduling capability may be affected. | Analyze the root cause immediately and check CCE events, core component pods, and AOM metrics. |
| P1 | High | Node disk space insufficient, associated with multiple pod eviction events | Persistent alarms, not cleared | 09:24 | Node resource pressure may cause unstable service replicas. | Check the pods, eviction events, and disk metrics on the associated node. |
| P2 | Medium | Frequent pod restarts, mainly in default/nginx-demo | Burst alarms, some cleared | 10:06 | Possibly related to recent releases, probe configurations, or resource limits | Query pod logs, events, and Deployment version history. |
| P3 | Low | Short-term CPU threshold alarms | Frequent alarms, automatically cleared | Multiple occurrences in the last seven days | No continuous impact. The traffic may fluctuate for a short period of time. | Observe the trend and adjust the threshold or HPA policy if necessary. |
For critical, high, or user-concerned alarm groups, you can continue to use AI CLI to analyze the root causes in the same time window.
Continue to analyze P1 warning. Keep the time window to the last 4 hours. Help me associate the pods on this node, recent events, related metrics, and possible root causes.
AI CLI continues to query the context of the alarm group and provides root cause clues.
| Root Cause Analysis Item | Available Information |
|---|---|
| Related resources | Alarm node, affected pods, namespaces, workloads, and Services |
| Related events | Events such as eviction, scheduling failure, probe failure, image pull failure, and node exception |
| Related metrics | Trends of CPU, memory, disk, and network resources, and changes before and after the alarm is triggered |
| Timeline | First occurrence time, burst time, recovery time, and related event occurrence time of the alarm |
| Preliminary root cause | For example, node disk pressure, workload release exception, core component exception, or insufficient capacity |
| Suggestions for the next step | Continue troubleshooting, clear resources, scale out, adjust thresholds, roll back the release, or keep observing. |
You can also ask AI CLI to output only critical and uncleared alarm groups:
View only the high- and critical-severity alarms that are not cleared in the last 4 hours, group them by affected resource, and provide the first occurrence time and root cause analysis suggestions for each group.
If the alarm is related to service unavailability, you can ask AI CLI to output a complete root cause analysis and recovery suggestions:
Continue root cause analysis based on these critical alarms, keep the time window from 09:00 to 12:00, and provide recovery suggestions.
To view the original details, enter the following:
Display the alarm details of the last 4 hours, including the alarm name, status, severity, resource, first occurrence time, and description.
Expected Results
After completing this practice, you can use AI CLI to complete the following closed-loop operations:
- Create 50 AOM alarm rules for the target cluster in one click based on the CCE cluster alarm best practices.
- Automatically identify the target cluster, prepare notification rules, preview the creation plan, and execute the creation after user confirmation.
- Automatically query the alarm rule list to confirm that rules for 38 metric alarms, 12 event alarms, and 0 failed alarms have been created.
- Query active alarms, historical alarms, and clearance records by cluster and time window.
- Automatically aggregate, deduplicate, and mark the severity of alarms to distinguish between normal alarms, burst alarms, and alarms that remain uncleared.
- Output the first occurrence time, burst time, affected resources, and handling queue sorted by priority.
- For critical and high-severity alarms, associate events, logs, metrics, and resource statuses within the same time window, and output possible root causes and the next diagnosis path.
FAQs
Prometheus Instance Not Found
- Symptom
{ "success": false, "error": "The Prometheus instance corresponding to the target cluster is not found." }
- Handling suggestion
- Query the AOM Prometheus instance and check whether a CCE instance is associated with the target cluster.
- Query the CCE add-on and check whether the components related to cloud native monitoring have been installed.
- If monitoring is just enabled on the console, wait until the instance association information is synchronized and try again.
Why Is the Number of Queried Notification Rules Inconsistent with That on the Console?
You are advised to use AI CLI to query the AOM notification rules in the current region again. If the inconsistency persists, check whether the current credential, project, and region match those on the console.
Why Does Automatic Notification Rule Creation Fail?
Check whether the SMN topic exists and whether the current account has permission to access AOM notification rules and SMN topics. If yes, create the notification rule again.
There Are Many Alarms, But I Don't Know Which One to Handle First
You are advised to use AI CLI to re-aggregate and analyze alarms by time window and output the severity and handling priority. For example:
Aggregate AOM alarms of the test-ai-diagnoses cluster in the last 4 hours by resource and severity, and output the five alarm groups that need to be handled first.
If there are still a large number of high- or critical-severity alarms after aggregation, narrow down the scope and analyze the alarm groups that are not cleared, affect core namespaces, are associated with multiple resources, or burst.
Follow-up Suggestions
- After the rules are created, you are advised to query the alarm rules of the target cluster immediately to ensure that the number of rules, notification method, and enabling status meet your expectations.
- At the early stage after the new cluster rollout, you are advised to check active alarms and high-frequency historical alarms every day to check whether there are alarms with strict thresholds, repeated notifications, or alarms that are not cleared for a long time.
- You are advised to use a fixed time window for analysis during on-duty troubleshooting, for example, "last 30 minutes", "last 4 hours", or "after the change". This avoids interference from irrelevant historical alarms.
- For critical and high-severity alarm groups, you are advised to continue associating logs, events, metrics, and workload versions to form a root cause analysis link.
- For high-frequency alarms that have no impact on services, you are advised to analyze the triggering objects and time periods, and then evaluate the threshold or notification scope optimization solution.
- For production clusters, you are advised to periodically export the alarm rule list as an important material for change audit and fault review.
- After alarms are triggered, you are advised to check the alarm aggregation and severity marking results, and then query related logs, events, and metrics. Do not perform recovery actions based on a single alarm.
Helpful Links
- Configuring Custom Alarms on AOM: Learn how to configure custom alarms on AOM for the CCE alarm center and key configuration items such as Prometheus instances.
- Configuring AOM Alarm Rules: Learn how to configure AOM alarm rules, rule parameters, and alarm triggering logic.
- Monitoring CCE Metrics: Learn how to use AOM to monitor CCE metrics and understand the relationships between metrics, events, and alarm notifications.
- Creating a Topic: Learn how to create a topic in SMN before notification rules are automatically created or associated.
- Cloud Container Engine (CCE) documentation: Learn about CCE clusters, add-ons, O&M, and workloads.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot