Help Center/ Cloud Container Engine/ Skill Reference/ Huawei Cloud Cloud-Native Skill Best Practices/ Building an Intelligent O&M Agent for the CCE Production Environment Based on Hermes and Lark
Updated on 2026-06-05 GMT+08:00

Building an Intelligent O&M Agent for the CCE Production Environment Based on Hermes and Lark

This document describes how to build a ChatOps on-duty Agent for the CCE production environment by connecting Hermes to Lark. The Agent can periodically scan alarms on the live network, automatically merge and analyze alarms, generate recovery solutions, and execute recovery actions after users confirm the recovery on the Lark mobile app. The high CPU usage alarm in this document is only a small case for verifying the closure process. You can extend custom capabilities such as pod restart diagnosis, node exception handling, scheduling failure recovery, capacity inspection, release change association, and daily report generation based on the same idea.

Scenarios

In a production environment, a CCE cluster may continuously generate a large number of alarms, covering multiple dimensions such as workloads, pods, nodes, networks, storage, auto scaling, resource capacity, and availability risks. The larger the live network scale, the more complex the numbers of alarms, alarm sources, and handling paths. On-duty personnel need to frequently switch between the Lark alarms, AOM metrics, Kubernetes events, pod logs, workload configurations, HPA statuses, node capacity, and service ticket system. This can lead to issues such as alarm fatigue, slow response, incomplete evidence, lack of review for recovery actions, and difficulty in accumulating review materials.

By building an intelligent O&M Agent for the CCE production environment, you can consolidate alarm detection, alarm merging, context collection, root cause analysis, recovery preview, user confirmation, recovery execution, effect verification, and result archiving into a set of reusable ChatOps on-duty capabilities. The Agent can use Hermes or your existing ChatOps, AIOps, service ticket assistant, or self-developed on-duty robot. Huawei Cloud cloud-native Skills provide standard query, analysis, and controlled recovery capabilities for resources such as CCE, AOM, LTS, node pools, and workloads.

Solution Architecture

This solution uses the "Agent Runtime + cloud-native Skills + Lark confirmation" architecture. Agent Runtime is responsible for task scheduling, alarm distribution, context orchestration, and Lark interaction. Cloud-native Skills provide capabilities such as alarms, metrics, logs, events, root cause analysis, and recovery actions. Lark carries alarm notification, user review, and closed-loop results.

You are advised to split Agent permissions based on the capability boundary.

Capability Domain

Typical Action

Recommended Control Mode

Alarm governance

Query, merge, classify, and route active and historical AOM alarms.

Read-only permission, allowing scheduled automatic execution.

Diagnosis analysis

Aggregate metrics, events, logs, workloads, and node statuses.

Read-only permission, allowing automatic orchestration by the Agent.

Recovery preview

Generate recovery solutions such as those for scaling, rollback, HPA, and node pools.

Only preview is generated, and resources are not modified.

Recovery execution

Execute the change action.

Must be confirmed via Lark

Audit data archiving

Save alarm, analysis, confirmation, execution, and verification records.

You are advised to access the service ticket, OBS, or internal knowledge base.

Constraints

  • Use the recommended configurations of cloud services for alarm rules and metric specifications, and perform cross-verification based on real-time metrics.
  • The preview and confirmation mechanisms must be retained for recovery actions, especially production operations such as scaling, rollback, and node pool changes.
  • Lark messages should be readable to on-duty personnel. The conclusion, evidence, affected objects, optional solutions, and confirmation entry should be provided first.
  • For complex recovery links, you are advised to make alarm fingerprints, solution IDs, target resources, and execution records persistent for confirmation and audit.
  • Do not expose sensitive information such as AKs/SKs, tokens, certificates, and project IDs in prompts, Lark messages, screenshots, or documents.
  • In the pilot phase, you are advised to start with read-only inspection and manual confirmation for recovery, and then gradually expand to more automatic recovery policies.

Prerequisites

Before performing this practice, you are advised to prepare the Agent running environment and CCE observability objects, and then gradually expand the automation scope.

  • You have prepared the CCE clusters, namespaces, or service scope for inspection and diagnosis.
  • AOM alarms, cloud native monitoring metrics, or existing inspection objects have been connected.
  • You have prepared Hermes or your own Agent Runtime and connected it to Lark or the service ticket system.
  • You have imported Huawei Cloud cloud-native Skills related to CCE to the Agent for querying alarms, metrics, events, workloads, pods, and nodes, and performing controlled recovery.
  • You have prepared access credentials and have configured them securely to ensure that sensitive information is not exposed in prompts or documents.

Orchestratable Capabilities

You can combine the following Skills as needed to build an intelligent O&M process for countless alarms.

Capability

Representative Skill

Function

Alarm detection and merging

alarm-correlation-engine

Queries active and historical alarms, merges duplicate alarms, and identifies alarms that need to be handled.

Observability context

observability-context-builder

Aggregates metrics, logs, events, and resource statuses.

Metric analysis

metric-analyzer

Analyzes trends of CPU, memory, network, and disk usages.

Pod diagnosis

pod-failure-diagnoser

Analyzes pod statuses, restarts, logs, and events.

Workload diagnosis

workload-failure-diagnoser

Analyzes Deployments, ReplicaSets, HPAs, Services, and endpoints.

Node diagnosis

node-failure-diagnoser

Analyzes node statuses, resource watermarks, and scheduling capabilities.

Change association

change-impact-analyzer

Associates release, configuration, and resource changes before and after an alarm is generated.

Root cause analysis

root-cause-analyzer

Summarizes evidence and outputs root causes, confidence, and suggestions.

Controlled recovery

auto-remediation-runner

Generates a recovery preview and executes the recovery action after confirmation.

For live-network alarm governance, Agent capabilities can be classified into the following levels.

Level

Purpose

Example

Alarm entry

Receives and discovers alarms from different sources.

AOM alarms, inspection tasks, Lark messages, and service ticket events

Alarm governance

Reduces alarm noise and determines handling priorities.

Deduplication, merging, classification, routing, silence, and summary

Intelligent diagnosis

Locates candidate causes from multi-source data.

Alarms, metrics, logs, events, changes, and resource statuses

Controlled recovery

Converts recommended actions into reviewable recovery solutions.

Scaling, HPA adjustment, node pool scale-out, and rollback

Closed-loop operations

Consolidates handling results into reusable experience.

Lark notifications, service ticket archiving, daily reports, and review materials

Procedure

This practice is based on the high CPU usage alarm of the default/chat-app workload in the demo-recovery cluster. It aims to verify the end-to-end capabilities of the intelligent O&M Agent in the CCE production environment. The high CPU usage alarm is just one type of alarm on the live network. The process focuses on demonstrating the complete link from normal inspection, alarm detection, automatic analysis, user confirmation, controlled recovery, to closed-loop verification.

Step 1: Start the Inspection Robot

After initializing the Agent, load CCE-related Skills and configure the inspection period and Lark notification recipient.

The inspection robot should be able to provide clear results in both "no alarm" and "alarm" states so that on-duty personnel can determine whether the inspection link is normal.

Normally, the message indicating that the demo-recovery cluster is normal and the environment is normal will be displayed.

Step 2: Receive a High CPU Usage Alarm

When AOM generates a high CPU usage alarm, the inspection robot outputs an alarm summary in Lark and adds the alarm to the automatic analysis process. In the production environment, the same entry can also receive other types of alarms, such as pod restart, node exception, scheduling failure, no backend for a Service, and HPA not taking effect.

After receiving an alarm, the Agent should pay attention to both the AOM alarm status and real-time metrics to avoid making decisions based on a single signal.

Step 3: Automatically Analyze Alarms and Generate a Recovery Preview

After detecting high CPU usage, the inspection robot automatically collects alarms, pod metrics, node watermarks, workload statuses, and affected objects to generate a diagnosis report. The report should highlight observable facts, evidence chains, and optional recovery solutions.

The recovery preview should include the following items:

Item

Content

Alarm summary

Alarm name, severity, status, and trigger time

Affected objects

Cluster, namespace, workload, pod, and node

Key evidence

CPU watermark, node watermark, pod status, and related alarms

Candidate solutions

Replica scale-out, resource adjustment, node pool scale-out, manual handling, and other solutions

Change impact

Resource usage, scheduling conditions, cost changes, and rollback methods

User confirmation

Must provide clear confirmation statements or buttons.

Step 4: Confirm the Recovery Solution in Lark

This is a key review step in the production change process. In this phase, the Agent only waits for user confirmation and does not perform any operations that change the live network status, such as scaling, rollback, restart, or node pool change. The Agent will only execute the recovery operation based on the confirmed solution after the user clearly replies with a confirmation statement such as "Confirm execution" or "Confirm execution of solution A/B" in Lark.

The confirmation statement must match the solution number in the recovery preview. For example, "Confirm execution of solution A" corresponds to adding replicas, and "Confirm execution of solution B" corresponds to upgrading resource specifications or expanding capacity. You can also replace the confirmation action with a Lark card button, service ticket approval, or enterprise approval process. However, the core requirement remains unchanged: Without manual confirmation, the Agent only performs analysis and preview and does not make changes to the live network.

Step 5: Perform Recovery and Continuous Review

In this case, the workload is first scaled out by increasing the number of replicas of chat-app from 2 to 4. After the scale-out, the Agent continues to review the pod statuses and node capacity and finds that one of the new pods is pending due to insufficient CPU resources on the node.

This branch highlights an important aspect of production recovery: recovery actions require continuous verification. A successful scale-out request does not necessarily mean that all pods have been successfully scheduled or that the alarm has been cleared. The Agent should continue to provide the verification results to the user and propose the next steps.

Step 6: Add Capacity and Complete Closed-Loop Operations

When the scale-out is limited by the node capacity, the Agent can generate a new capacity recovery solution, such as adding a node pool, adjusting workload requests, optimizing the HPA upper limit, or integrating with CCI for auto scaling. In this example, a node pool is added for scale-out. After the new nodes are online, Kubernetes automatically schedules the pending pods.

After the recovery action is performed, the Agent inspects the alarms, pods, nodes, and CPU metrics again and sends the closed-loop result to Lark.

Hermes Task Prompt Reference

The following prompts can be used as a task template for the Hermes ChatOps on-duty Agent. It downplays specific commands and environment variables, retaining only the role, process, output structure, and security boundaries.

You are the intelligent O&M Agent of the CCE production environment. You are responsible for performing alarm inspection, alarm merging, automatic analysis, recovery preview, recovery after user confirmation, recovery verification, and Lark closed-loop notification in the target CCE environment.

Prerequisites:
- CCE-related cloud-native Skills have been imported to the current Agent in advance.
- The target cluster, inspection scope, notification channel, and access credentials have been provided by the runtime environment.
- All notifications are sent to Lark or the on-duty channel specified by the customer.

Purposes:
1. Periodically scan active alarms and recent historical alarms in the target CCE environment.
2. Deduplicate, merge, classify, route, and summarize the impact scope of alarms.
3. When the inspection is normal, output a concise health summary and do not exit silently.
4. When alarms need to be handled, automatically aggregate the context, including AOM alarms, real-time metrics, Kubernetes events, pod/workload/node statuses, log summaries, and recent changes.
5. Generate a diagnosis report for on-duty personnel, including the alarm summary, affected objects, key evidence, possible causes, recommended solutions, and actions to be confirmed.
6. For actions involving resource changes, only a recovery preview is generated, and no action is directly performed on the live network.
7. The recovery action is executed only after the user clearly replies with a confirmation statement such as "Confirm execution" or confirms a specific solution in Lark.
8. After the execution, verify the alarm statuses, pod statuses, workload replicas, node capacity, and key metrics, and send the closed-loop result to Lark.

Recommended structure of the inspection report:
- Inspection summary: cluster, time window, number of active alarms, and key resource statuses
- Alarm discovery: alarm name, severity, status, affected objects, and current observed value
- System analysis: alarm statuses, real-time metrics, pod/workload/node statuses, related events, and recent changes
- Recovery solution: Provide two to three optional solutions, including the scenarios, impact scope, rollback method, and verification method.
- User confirmation: Prompt the user to reply "Confirm" or select a specific solution.

Security boundaries:
- Only read-only operations are allowed during alarm scanning, evidence collection, and root cause analysis.
- Recovery cannot be performed based on a single alarm.
- For all write operations, the recovery preview, impact scope, rollback method, and verification method must be provided first.
- The solution number and meaning in the same alarm must be consistent. After the user confirms the solution, the solution is executed.
- Before user confirmation, do not perform any operations that change the live network status, such as scaling, rollback, restart, or node pool change.
- Do not expose sensitive information such as AKs/SKs, tokens, certificates, and project IDs in the output.
- If the evidence is insufficient, list possible causes and information that requires manual review. Do not make judgments for the user without sufficient evidence.

Lark output requirements:
- When the inspection is normal, output a concise health summary.
- When an alarm is detected, output the alarm summary and affected objects first, and then output the key evidence and candidate solutions.
- If recovery is required, clearly prompt the user to reply "Confirm" or select a specific solution. Before receiving manual confirmation, wait for confirmation and do not perform the change.
- After the recovery is complete, output the execution actions, verification results, remaining risks, and subsequent suggestions.

Diagnosis Results

The focus of this case is not that capacity expansion is necessary when the CPU usage is high. Instead, it demonstrates a transferable method: Alarm detection by the Agent → Evidence aggregation by the Skill → Manual confirmation and recovery → System execution and verification. You can replace any of the phases with your own tools, approval processes, and business rules.

Phase

Result

Alarm detection

The container CPU usage is greater than 80%.

Automatic analysis

Aggregate AOM alarms, pod metrics, node watermarks, and workload statuses.

Recovery preview

Provide optional solutions such as adding replicas and wait for confirmation from Lark.

First recovery

Scale the workload from two replicas to four replicas.

Process review

A new pod is pending due to insufficient node resources.

Add action

Add node pools to expand the scheduling capacity.

Closed-loop verification

Clear active alarms, check that the pod and node are normal, and output the result on Lark.

Extended Scenarios

You can extend this practice from multiple dimensions based on your requirements and existing tools to adapt to different O&M scenarios and service requirements. The following provides suggestions and examples for extending this practice from different dimensions.

Extension Direction

Example

Replacing the Agent

Use Hermes, OpenClaw, AI CLI, enterprise ChatOps robot, or self-developed Agent. Different Agents provide different functions and integration capabilities. You can select an appropriate Agent based on your technology stack and requirements.

Changing the entry

Triggered by AOM alarms, Lark messages, service tickets, scheduled tasks, release events, or manual inquiries. Different entries can adapt to different alarm sources and trigger modes, improving the flexibility and response speed of alarm handling.

Changing the Skill combination

Orchestrate different Skills for scenarios such as pods, nodes, networks, storage, HPAs, and costs. By combining different Skills, you can provide more accurate diagnosis and recovery capabilities for specific O&M scenarios.

Changing approval methods

Use Lark reply, Lark card button, service ticket approval, and change approval flow. Different approval methods can adapt to different enterprise approval processes and security requirements, ensuring the compliance and security of recovery actions.

Changing recovery actions

Scaling, HPA adjustment, node pool scale-out, rollback, node isolation, and stopping abnormal tasks. Different recovery actions can address different fault types and recovery requirements, improving the flexibility and effectiveness of recovery.

Changing the archiving mode

Output to Lark, service tickets, OBS, daily reports, knowledge bases, or audit systems. Different archiving modes can meet different recording and audit requirements, ensuring that the alarm handling process can be traced and reviewed.

The following table lists the typical extension scenarios.

Scenario

Orchestration Approach

Frequent pod restarts

Aggregate the number of restarts, previous logs, OOM events, probe configurations, and events to generate a preview of rollback or resource adjustment. This helps quickly locate and resolve frequent pod restarts.

Pod Pending

Analyze the node capacity, taint tolerance, affinity, PVC, image pull, and quota to generate scheduling recovery suggestions. This helps resolve pod scheduling failures.

Abnormal nodes

Associate node statuses, resource watermarks, component statuses, and events to generate isolation, migration, or node pool scale-out preview. This helps quickly handle node exceptions.

No backend for the Service

Analyze the Deployment, Endpoint, Service Selector, and release statuses to locate release or selector issues. This helps locate the causes and generates corresponding recovery suggestions.

HPA not taking effect

Analyze metric collection, request configuration, HPA upper and lower limits, and scaling events. This helps diagnose the causes and generates corresponding recovery suggestions.

Periodic inspection

Periodically output alarms, resource watermarks, abnormal pods, node risks, and cost optimization suggestions. This helps promptly identify and address potential issues and optimize resource utilization.

Expected Results

After completing this practice, you can obtain the following benefits:

  1. After a CCE alarm is reported to Lark, the Agent automatically starts the diagnosis link.
  2. O&M personnel can view the alarm summary, evidence, and candidate solutions without repeatedly switching between multiple systems.
  3. Recovery actions are confirmed on Lark before execution, reducing misoperations.
  4. After the recovery is executed, the alarm convergence, pod status, node capacity, and metric trends are automatically verified.
  5. The alarm handling process can be archived, audited, and reviewed.
  6. The same Agent + Skill orchestration approach can be extended to more CCE O&M scenarios.