Help Center/ Cloud Container Engine/ Skill Reference/ Huawei Cloud Cloud-Native Skill Best Practices/ Using OpenClaw to Perform Periodic Inspection on CCE Clusters

Updated on 2026-06-05 GMT+08:00

Using OpenClaw to Perform Periodic Inspection on CCE Clusters

Scenarios

In the production environment, you need to continuously monitor the node health, pod status, core add-ons, resource usages, Kubernetes events, AOM alarms, and service ingress statuses of CCE clusters. By interconnecting OpenClaw Agent with cluster inspection, you can configure periodic inspection tasks using natural language, enabling the Agent to automatically perform cluster health check, aggregate alarms, analyze exceptions, classify risks, generate reports, and send notifications.

In this practice, you are advised to perform a quick inspection first and then a deep inspection if any exception is detected.

When the cluster is normal, the Agent outputs a concise health summary to reduce invalid noise.
If an exception is detected during the quick inspection, the Agent automatically extends the inspection to dimensions such as the pod, node, event, AOM, ELB, and resource usage.
The Agent queries AOM alarms generated in the last 24 hours and aggregates them by alarm type, severity, current status, and repetition frequency. It distinguishes active alarms, cleared alarms, burst alarms, and recurring alarms.
An in-depth inspection supplements top N historical pod metrics and top N node CPU, memory, and disk metrics, helping determine whether exceptions are related to resource watermarks.
AI grades risks based on inspection evidence and displays the impact scope, possible causes, and suggestions for next step in the report. The grading result is used for summary and suggestions. The tool is not required to return fixed fields.
During the inspection, only read-only queries and report generation are performed. Change actions such as scale-out, deletion, restart, and drain are not automatically executed.

OpenClaw Agent can be used to:

Perform scheduled CCE cluster inspection every day or week.
Automatically generate inspection reports in Markdown and HTML formats.
Push the inspection summary and report link to the O&M team via email.
Archive historical inspection reports for trend comparison and review.
When detecting a major risk, transfer the risk to related diagnosis capabilities for in-depth analysis.

Constraints

During the inspection, only read-only queries and report generation are performed. No automatic rectification actions are taken.
During the execution of an inspection task, APIs of cloud services such as CCE, AOM, LTS, and ELB are invoked, which may incur small expenditures for invoking APIs or log query costs.
Reports are stored in OBS, which will incur storage expenditures.
The email sending frequency is limited by the quota of SMTP or Huawei Cloud SES. You are advised to set a proper inspection frequency.
Do not write AKs/SKs, tokens, certificates, or real project IDs into documents, code, or dialog output.

Precautions

An in-depth inspection collects more metrics and context, including AOM alarms in the last 24 hours, top N historical pod metrics, and top N node CPU, memory, and disk metrics. The execution time may be significantly longer than that of a quick inspection.
Top N pod resources are queried based on the historical metric time window. The query result may contain pods that existed in the query time window but do not exist now. You can check whether the object still exists based on the current pod list.
Risk levels are generated by AI based on the factual evidence returned by the tool. You are advised to view the associated events, logs, and metrics before deciding whether to proceed with the recovery process.

Prerequisites

You have created a CCE cluster, and its status is Running.
You have enabled the OpenClaw service and have initialized the Agent.
You have connected the Agent to Huawei Cloud cloud-native capabilities.
You have installed the Cloud Native Cluster Monitoring add-on in the target CCE cluster.
You have configured AOM alarm rules for the target CCE cluster based on best practices. For details, see Using AI CLI to Configure, Query, and Manage CCE AOM Alarms.
You have configured Huawei Cloud access credentials. You are advised to use OpenClaw key management or environment variable injection to avoid exposing AKs/SKs in documents, scripts, or dialogs.
The inspection account has read-only permissions on CCE, AOM, LTS, ELB, and other related resources.
If email notifications are required, you have prepared SMTP or Huawei Cloud SES.
If you need to archive reports, you have prepared an OBS bucket or another storage location for reports.

Recommended Input

You can directly describe the target cluster, inspection period, inspection scope, and notification method in the OpenClaw dialog.

Scenarios and recommended input

Scenario	Recommended Input
Creating a daily inspection task	Create a daily inspection task for the test-ai-diagnoses cluster in CN North-Beijing4. The task should be executed at 9:00 a.m. every day. Perform a quick inspection first. If any exception is detected, perform an in-depth inspection. Send the report to the O&M team.
Executing the inspection task immediately	Execute the daily inspection task for the test-ai-diagnoses cluster in CN North-Beijing4 immediately. Perform quick check first. If any exception is detected, perform in-depth diagnosis.
Viewing the latest report	View the latest inspection report of the test-ai-diagnoses cluster and list the risks by severity.
Analyzing the trend of the last 7 days	Summarize the inspection results of the test-ai-diagnoses cluster in the last 7 days and tell me whether the number of risks has increased.
Performing in-depth analysis on exceptions	Continue to analyze the high-risk node issues in the inspection report and associate events, metrics, and related pods.

You are advised to focus on the following items in the inspection result:

Output	Focus
Inspection result	Whether the cluster is healthy and whether there are high-risk exceptions.
Exception group	Whether exceptions are concentrated on pods, nodes, events, AOM, ELB, or resources.
Impact scope	Which namespaces, nodes, workloads, or service entries are affected.
Risk trend	Whether issues are added, expanded, or resolved compared with the previous day or the last seven days.
Recommended action	Continue to observe, enter special diagnosis, perform scale-out evaluation, optimize rules, or transfer to the recovery process.

Procedure

Step 1: Create a Periodic Inspection Task for a CCE Cluster

Make the Agent create a periodic inspection task for a CCE cluster. The Agent will identify the region, cluster name, inspection time, report format, and notification method based on the input and generate an inspection plan.

Enter the following content in the OpenClaw dialog:

Create a daily inspection task for the test-ai-diagnoses cluster in CN North-Beijing4. The task should be executed at 9:00 a.m. every day. Perform a quick inspection first. If any exception is detected, perform an in-depth inspection. Generate Markdown and HTML inspection reports and send them to ops-team@company.com.

The Agent automatically generates an inspection plan. Confirm the following information:

Configuration Item	Example	Description
Region	cn-north-4	Region where the target CCE cluster is located
Cluster name	test-ai-diagnoses	CCE cluster to be inspected
Inspection time	09:00 every day	Off-peak hours or before shift handover is recommended.
Inspection policy	Quick inspection first, then in-depth inspection if exceptions are found	Reduce unnecessary heavy checks by default.
Report format	Markdown and HTML	Easy to read and archive via email.
Recipient	ops-team@company.com	Used for receiving the inspection summary and report link.
Storage location	obs://your-bucket/reports/	Used for saving historical inspection reports.

After the task is generated, the Agent performs the inspection as planned. You are advised to trigger an inspection immediately to verify the configuration.
```
Perform an inspection on the test-ai-diagnoses cluster immediately and send the test report.
```

Step 2: View the Inspection Report and Identify Risks

After the inspection is complete, the Agent generates an inspection summary and a complete report. You can directly view the latest inspection result.
```
View the latest inspection report of the test-ai-diagnoses cluster and list the risks by severity.
```

The Agent returns the inspection summary first.

Inspection Item	Example Result	Focus
Inspection result	Warning	Whether the cluster has risks that need to be handled
Check item quantity	12 items: 9 passed, 2 warnings, and 1 failed	Whether there are new risks
Node health	3/3 nodes are normal.	Whether there are NotReady, resource pressure, or node events
Pod statuses	Two pods are abnormal.	Whether there are CrashLoopBackOff, Pending, or Evicted pods
AOM alarms	140 alarms in the last 24 hours, 4 of which are not cleared	Whether there are continuous, burst, or recurring alarms
Core add-ons	Normal	Whether CoreDNS, network, and storage add-ons are healthy
Top N pod resources	Top N CPU/memory metrics in the last 24 hours	Whether there are metrics of historical high-watermark pods or disappeared pods
Top N node resources	Top N node CPU/memory/disk metrics	Whether there are risks in the node CPU, memory, and disk capacity

If an exception is detected during the inspection, the Agent determines the severity based on the evidence returned by the tool and outputs a list of issues.

Severity	Type	Resource	Issue	Evidence	Suggestion
High	Pod health	default/test	Available replicas: 0	The Deployment expects two replicas, but the number of ready replicas is 0. Related pods are in the Pending state.	Check pod events and image pull status, and restore workload availability first.
High	Node resources	192.168.32.2	Node CPU remaining high	Node CPU usage is 100%, and remains high in the last 24 hours.	Locate the high-CPU process or pod on the node, and evaluate migration or scale-out if necessary.
Medium	AOM alarms	default/test-*	Repeated image pull failure alarms	FailedPullImage and BackOffPullImage alarms occurred in the last 24 hours, and some alarms are not cleared.	Correct the image path or tag, and trigger the rolling update of the Deployment again.

For major or recurring issues, you can make the Agent continue the analysis.
```
Continue to analyze the unavailability of the default/test replica, and associate AOM alarms, Kubernetes events, pod statuses, and related metrics generated in the last 24 hours.
```
The Agent will continue to aggregate the context based on the inspection report and output root cause clues, impact scope, and suggestions for next step. For example, for major alarms or high-risk resource exceptions, you can make the Agent further analyze the related pods, node metrics, events, and logs within the corresponding time window to determine whether the exceptions share the same root cause.

Step 3: View Historical Trends and Archive Reports

The value of a periodic inspection is not only to detect exceptions on the current day, but also to observe whether risks persist, escalate, or recover. You can make the Agent summarize the inspection results over a period of time.

Summarize the inspection results of the test-ai-diagnoses cluster in the last 7 days and list the changes and new issues of high, medium, and low risks by date.

The Agent can output a trend summary.

Date	Execution Status	Total Check Items	High Risks	Medium Risks	Low Risks	New Issues	Remarks
2026-05-31	Successful	12	1	2	1	1	Pod restart issues added
2026-05-30	Successful	12	1	1	1	0	Continuous node memory pressure
2026-05-29	Successful	12	2	2	2	2	Core add-on exceptions
2026-05-28	Successful	12	0	0	0	0	Cluster health

You can also view the report archive location.

The Markdown and HTML links of the inspection reports for the test-ai-diagnoses cluster in the last seven days are listed.

The report is expected to retain the following content.

Report Content	Description
Inspection summary	Overall cluster status, number of check items, and number of high/medium/low risks
Exception list	Displayed by pod, node, event, AOM, ELB, and resource
Risk trend	Comparison with the previous inspection or the trend in the last seven days
Root cause clues	Providing entries to related logs, events, and metrics for major exceptions
Recommended action	Continue to observe, enter special diagnosis, perform capacity evaluation, or transfer to the recovery process.

Expected Results

After completing this practice, you can use OpenClaw Agent to complete the following closed-loop operations:

Automatically perform a periodic inspection for the target CCE cluster.
Perform a quick inspection first by default. If any exception is detected, perform the in-depth diagnosis or parallel inspection.
Summarize exceptions by pod, node, event, AOM, ELB, and resource.
Make AI mark the risk severity and impact scope based on the inspection evidence.
Automatically generate Markdown and HTML inspection reports and push them to the O&M team via email.
Archive historical inspection reports and support daily viewing and trend comparison.
For major or persistent risks, continue to collaborate with logs, events, metrics, and related diagnosis capabilities for root cause analysis.

FAQs

Is In-Depth Inspection Required for Each Inspection?

Not recommended. It is recommended that a quick inspection be performed first by default. If any exception is detected, the in-depth diagnosis or parallel inspection is performed. This reduces unnecessary API invocations, log queries, and report noise, improving inspection efficiency.

Will High-Risk Issues Found in the Inspection Be Automatically Rectified?

No. In this practice, OpenClaw Agent is only used for inspection and report generation and does not perform rectification actions. If rectification is required, the Agent can transfer the corresponding diagnosis or rectification capability and confirm the action before any change is made.

Why Are Pods That Are Currently Inaccessible Displayed in the Top N Pod Resources?

Top N pod resources are used to analyze resource usages in a historical time window. By default, historical metrics of the last 24 hours are queried. Therefore, pods that have been deleted or rebuilt may be displayed. You can make the Agent continue querying the current pod list and describe historical metric objects and current inventory objects separately to better understand resource usages.

Why Cannot I Receive the Email?

You are advised to check the email recipient, SMTP or SES configuration, sending records, email service quota, and enterprise email interception policy. If a report has been generated but the email fails to be sent, you can view the report on the OpenClaw console or in the report archive path.

How Long Should Historical Reports Be Retained?

You are advised to retain inspection reports of a production cluster for at least 30 days. If you need to perform monthly stability review, SLA statistics, or capacity trend analysis, you can retain the reports for 90 days or longer and configure OBS lifecycle policies to control storage costs. This ensures that historical data can be quickly accessed and storage costs are managed efficiently.

Helpful Links

Cloud Container Engine (CCE): Learn how to query CCE clusters, nodes, workloads, add-ons, and O&M products.
Application Operations Management (AOM): Learn how to query AOM metrics, alarms, logs, and application O&M description.

Parent Topic: Huawei Cloud Cloud-Native Skill Best Practices

Previous topic: Using AI CLI to Diagnose and Rectify CCE Workload Faults

Next topic: Using AI CLI to Configure, Query, and Manage CCE AOM Alarms

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

For any further questions, feel free to contact us through the chatbot.

Chatbot