Help Center/ Cloud Container Engine/ Skill Reference/ Huawei Cloud Cloud-Native Skill Best Practices/ Using OpenClaw to Perform Periodic Inspection on CCE Clusters
Updated on 2026-06-05 GMT+08:00

Using OpenClaw to Perform Periodic Inspection on CCE Clusters

Scenarios

In the production environment, you need to continuously monitor the node health, pod status, core add-ons, resource usages, Kubernetes events, AOM alarms, and service ingress statuses of CCE clusters. By interconnecting OpenClaw Agent with cluster inspection, you can configure periodic inspection tasks using natural language, enabling the Agent to automatically perform cluster health check, aggregate alarms, analyze exceptions, classify risks, generate reports, and send notifications.

In this practice, you are advised to perform a quick inspection first and then a deep inspection if any exception is detected.

  • When the cluster is normal, the Agent outputs a concise health summary to reduce invalid noise.
  • If an exception is detected during the quick inspection, the Agent automatically extends the inspection to dimensions such as the pod, node, event, AOM, ELB, and resource usage.
  • The Agent queries AOM alarms generated in the last 24 hours and aggregates them by alarm type, severity, current status, and repetition frequency. It distinguishes active alarms, cleared alarms, burst alarms, and recurring alarms.
  • An in-depth inspection supplements top N historical pod metrics and top N node CPU, memory, and disk metrics, helping determine whether exceptions are related to resource watermarks.
  • AI grades risks based on inspection evidence and displays the impact scope, possible causes, and suggestions for next step in the report. The grading result is used for summary and suggestions. The tool is not required to return fixed fields.
  • During the inspection, only read-only queries and report generation are performed. Change actions such as scale-out, deletion, restart, and drain are not automatically executed.

OpenClaw Agent can be used to:

  • Perform scheduled CCE cluster inspection every day or week.
  • Automatically generate inspection reports in Markdown and HTML formats.
  • Push the inspection summary and report link to the O&M team via email.
  • Archive historical inspection reports for trend comparison and review.
  • When detecting a major risk, transfer the risk to related diagnosis capabilities for in-depth analysis.

Constraints

  • During the inspection, only read-only queries and report generation are performed. No automatic rectification actions are taken.
  • During the execution of an inspection task, APIs of cloud services such as CCE, AOM, LTS, and ELB are invoked, which may incur small expenditures for invoking APIs or log query costs.
  • Reports are stored in OBS, which will incur storage expenditures.
  • The email sending frequency is limited by the quota of SMTP or Huawei Cloud SES. You are advised to set a proper inspection frequency.
  • Do not write AKs/SKs, tokens, certificates, or real project IDs into documents, code, or dialog output.

Precautions

  • An in-depth inspection collects more metrics and context, including AOM alarms in the last 24 hours, top N historical pod metrics, and top N node CPU, memory, and disk metrics. The execution time may be significantly longer than that of a quick inspection.
  • Top N pod resources are queried based on the historical metric time window. The query result may contain pods that existed in the query time window but do not exist now. You can check whether the object still exists based on the current pod list.
  • Risk levels are generated by AI based on the factual evidence returned by the tool. You are advised to view the associated events, logs, and metrics before deciding whether to proceed with the recovery process.

Prerequisites

  • You have created a CCE cluster, and its status is Running.
  • You have enabled the OpenClaw service and have initialized the Agent.
  • You have connected the Agent to Huawei Cloud cloud-native capabilities.
  • You have installed the Cloud Native Cluster Monitoring add-on in the target CCE cluster.
  • You have configured AOM alarm rules for the target CCE cluster based on best practices. For details, see Using AI CLI to Configure, Query, and Manage CCE AOM Alarms.
  • You have configured Huawei Cloud access credentials. You are advised to use OpenClaw key management or environment variable injection to avoid exposing AKs/SKs in documents, scripts, or dialogs.
  • The inspection account has read-only permissions on CCE, AOM, LTS, ELB, and other related resources.
  • If email notifications are required, you have prepared SMTP or Huawei Cloud SES.
  • If you need to archive reports, you have prepared an OBS bucket or another storage location for reports.

Recommended Input

You can directly describe the target cluster, inspection period, inspection scope, and notification method in the OpenClaw dialog.

Scenarios and recommended input

Scenario

Recommended Input

Creating a daily inspection task

Create a daily inspection task for the test-ai-diagnoses cluster in CN North-Beijing4. The task should be executed at 9:00 a.m. every day. Perform a quick inspection first. If any exception is detected, perform an in-depth inspection. Send the report to the O&M team.

Executing the inspection task immediately

Execute the daily inspection task for the test-ai-diagnoses cluster in CN North-Beijing4 immediately. Perform quick check first. If any exception is detected, perform in-depth diagnosis.

Viewing the latest report

View the latest inspection report of the test-ai-diagnoses cluster and list the risks by severity.

Analyzing the trend of the last 7 days

Summarize the inspection results of the test-ai-diagnoses cluster in the last 7 days and tell me whether the number of risks has increased.

Performing in-depth analysis on exceptions

Continue to analyze the high-risk node issues in the inspection report and associate events, metrics, and related pods.

You are advised to focus on the following items in the inspection result:

Output

Focus

Inspection result

Whether the cluster is healthy and whether there are high-risk exceptions.

Exception group

Whether exceptions are concentrated on pods, nodes, events, AOM, ELB, or resources.

Impact scope

Which namespaces, nodes, workloads, or service entries are affected.

Risk trend

Whether issues are added, expanded, or resolved compared with the previous day or the last seven days.

Recommended action

Continue to observe, enter special diagnosis, perform scale-out evaluation, optimize rules, or transfer to the recovery process.

Procedure

Step 1: Create a Periodic Inspection Task for a CCE Cluster

Make the Agent create a periodic inspection task for a CCE cluster. The Agent will identify the region, cluster name, inspection time, report format, and notification method based on the input and generate an inspection plan.

  1. Enter the following content in the OpenClaw dialog:
    Create a daily inspection task for the test-ai-diagnoses cluster in CN North-Beijing4. The task should be executed at 9:00 a.m. every day. Perform a quick inspection first. If any exception is detected, perform an in-depth inspection. Generate Markdown and HTML inspection reports and send them to ops-team@company.com.
  2. The Agent automatically generates an inspection plan. Confirm the following information:

    Configuration Item

    Example

    Description

    Region

    cn-north-4

    Region where the target CCE cluster is located

    Cluster name

    test-ai-diagnoses

    CCE cluster to be inspected

    Inspection time

    09:00 every day

    Off-peak hours or before shift handover is recommended.

    Inspection policy

    Quick inspection first, then in-depth inspection if exceptions are found

    Reduce unnecessary heavy checks by default.

    Report format

    Markdown and HTML

    Easy to read and archive via email.

    Recipient

    ops-team@company.com

    Used for receiving the inspection summary and report link.

    Storage location

    obs://your-bucket/reports/

    Used for saving historical inspection reports.

  3. After the task is generated, the Agent performs the inspection as planned. You are advised to trigger an inspection immediately to verify the configuration.
    Perform an inspection on the test-ai-diagnoses cluster immediately and send the test report.

Step 2: View the Inspection Report and Identify Risks

  1. After the inspection is complete, the Agent generates an inspection summary and a complete report. You can directly view the latest inspection result.
    View the latest inspection report of the test-ai-diagnoses cluster and list the risks by severity.
  2. The Agent returns the inspection summary first.

    Inspection Item

    Example Result

    Focus

    Inspection result

    Warning

    Whether the cluster has risks that need to be handled

    Check item quantity

    12 items: 9 passed, 2 warnings, and 1 failed

    Whether there are new risks

    Node health

    3/3 nodes are normal.

    Whether there are NotReady, resource pressure, or node events

    Pod statuses

    Two pods are abnormal.

    Whether there are CrashLoopBackOff, Pending, or Evicted pods

    AOM alarms

    140 alarms in the last 24 hours, 4 of which are not cleared

    Whether there are continuous, burst, or recurring alarms

    Core add-ons

    Normal

    Whether CoreDNS, network, and storage add-ons are healthy

    Top N pod resources

    Top N CPU/memory metrics in the last 24 hours

    Whether there are metrics of historical high-watermark pods or disappeared pods

    Top N node resources

    Top N node CPU/memory/disk metrics

    Whether there are risks in the node CPU, memory, and disk capacity

  3. If an exception is detected during the inspection, the Agent determines the severity based on the evidence returned by the tool and outputs a list of issues.

    Severity

    Type

    Resource

    Issue

    Evidence

    Suggestion

    High

    Pod health

    default/test

    Available replicas: 0

    The Deployment expects two replicas, but the number of ready replicas is 0. Related pods are in the Pending state.

    Check pod events and image pull status, and restore workload availability first.

    High

    Node resources

    192.168.32.2

    Node CPU remaining high

    Node CPU usage is 100%, and remains high in the last 24 hours.

    Locate the high-CPU process or pod on the node, and evaluate migration or scale-out if necessary.

    Medium

    AOM alarms

    default/test-*

    Repeated image pull failure alarms

    FailedPullImage and BackOffPullImage alarms occurred in the last 24 hours, and some alarms are not cleared.

    Correct the image path or tag, and trigger the rolling update of the Deployment again.

  4. For major or recurring issues, you can make the Agent continue the analysis.
    Continue to analyze the unavailability of the default/test replica, and associate AOM alarms, Kubernetes events, pod statuses, and related metrics generated in the last 24 hours.

    The Agent will continue to aggregate the context based on the inspection report and output root cause clues, impact scope, and suggestions for next step. For example, for major alarms or high-risk resource exceptions, you can make the Agent further analyze the related pods, node metrics, events, and logs within the corresponding time window to determine whether the exceptions share the same root cause.

Step 3: View Historical Trends and Archive Reports

The value of a periodic inspection is not only to detect exceptions on the current day, but also to observe whether risks persist, escalate, or recover. You can make the Agent summarize the inspection results over a period of time.

Summarize the inspection results of the test-ai-diagnoses cluster in the last 7 days and list the changes and new issues of high, medium, and low risks by date.
The Agent can output a trend summary.

Date

Execution Status

Total Check Items

High Risks

Medium Risks

Low Risks

New Issues

Remarks

2026-05-31

Successful

12

1

2

1

1

Pod restart issues added

2026-05-30

Successful

12

1

1

1

0

Continuous node memory pressure

2026-05-29

Successful

12

2

2

2

2

Core add-on exceptions

2026-05-28

Successful

12

0

0

0

0

Cluster health

You can also view the report archive location.
The Markdown and HTML links of the inspection reports for the test-ai-diagnoses cluster in the last seven days are listed.
The report is expected to retain the following content.

Report Content

Description

Inspection summary

Overall cluster status, number of check items, and number of high/medium/low risks

Exception list

Displayed by pod, node, event, AOM, ELB, and resource

Risk trend

Comparison with the previous inspection or the trend in the last seven days

Root cause clues

Providing entries to related logs, events, and metrics for major exceptions

Recommended action

Continue to observe, enter special diagnosis, perform capacity evaluation, or transfer to the recovery process.

Expected Results

After completing this practice, you can use OpenClaw Agent to complete the following closed-loop operations:

  1. Automatically perform a periodic inspection for the target CCE cluster.
  2. Perform a quick inspection first by default. If any exception is detected, perform the in-depth diagnosis or parallel inspection.
  3. Summarize exceptions by pod, node, event, AOM, ELB, and resource.
  4. Make AI mark the risk severity and impact scope based on the inspection evidence.
  5. Automatically generate Markdown and HTML inspection reports and push them to the O&M team via email.
  6. Archive historical inspection reports and support daily viewing and trend comparison.
  7. For major or persistent risks, continue to collaborate with logs, events, metrics, and related diagnosis capabilities for root cause analysis.

FAQs

Is In-Depth Inspection Required for Each Inspection?

Not recommended. It is recommended that a quick inspection be performed first by default. If any exception is detected, the in-depth diagnosis or parallel inspection is performed. This reduces unnecessary API invocations, log queries, and report noise, improving inspection efficiency.

Will High-Risk Issues Found in the Inspection Be Automatically Rectified?

No. In this practice, OpenClaw Agent is only used for inspection and report generation and does not perform rectification actions. If rectification is required, the Agent can transfer the corresponding diagnosis or rectification capability and confirm the action before any change is made.

Why Are Pods That Are Currently Inaccessible Displayed in the Top N Pod Resources?

Top N pod resources are used to analyze resource usages in a historical time window. By default, historical metrics of the last 24 hours are queried. Therefore, pods that have been deleted or rebuilt may be displayed. You can make the Agent continue querying the current pod list and describe historical metric objects and current inventory objects separately to better understand resource usages.

Why Cannot I Receive the Email?

You are advised to check the email recipient, SMTP or SES configuration, sending records, email service quota, and enterprise email interception policy. If a report has been generated but the email fails to be sent, you can view the report on the OpenClaw console or in the report archive path.

How Long Should Historical Reports Be Retained?

You are advised to retain inspection reports of a production cluster for at least 30 days. If you need to perform monthly stability review, SLA statistics, or capacity trend analysis, you can retain the reports for 90 days or longer and configure OBS lifecycle policies to control storage costs. This ensures that historical data can be quickly accessed and storage costs are managed efficiently.

Helpful Links