Help Center/ Cloud Container Engine/ Skill Reference/ Huawei Cloud Cloud-Native Skill Best Practices/ Using AI CLI to Diagnose and Rectify CCE Workload Faults

Updated on 2026-07-02 GMT+08:00

Using AI CLI to Diagnose and Rectify CCE Workload Faults

Scenarios

In a CCE cluster, after workload release, scaling, or configuration change, there may be issues such as pod not ready for a long time, Deployment rolling upgrade suspended, frequent container restarts, image pull failures, scheduling failures, and service endpoint exceptions. Traditional troubleshooting usually requires O&M personnel to repeatedly switch between Deployments, ReplicaSets, pods, events, container logs, probe configurations, and monitoring data. The fault locating process is long and manual judgment is costly.

By combining AI CLI with cloud-native Skills, you can describe fault symptoms in natural language. The Agent automatically completes context identification, evidence collection, root cause analysis, recovery solution preview, user confirmation, recovery execution, and result verification. This helps O&M teams standardize workload troubleshooting into a repeatable process.

This practice applies to common scenarios of CCE workload fault diagnosis and recovery, and is not bound to a fixed namespace, workload name, or single test environment. In this document, ai-diagnose-demo is only a demo namespace. Replace it with your target cluster and service namespace.

Constraints

The diagnosis accuracy of AI CLI depends on the cluster access permissions, log retention period, event integrity, and quality of observability data.
All write operations must comply with the "preview + confirmation" mode. Rollback, restart, scaling, or configuration modification cannot be performed without confirmation.
In the production environment, you are advised to separate diagnosis permissions from recovery permissions and record all AI CLI operations in audit logs.
Rollback depends on the historical revisions of a workload. If a historical version has been cleared, restore it by re-releasing it or restoring its configurations.
If faults involve database changes, data format changes, external dependencies, or message accumulation, evaluate data consistency and service compensation solutions before the rollback.
In scenarios involving multiple clusters, namespaces, or abnormal objects, AI CLI should require users to confirm the target scope to avoid cross-service misoperations.
Do not output sensitive information such as the AK/SK, token, certificate, and real project ID in prompts, diagnosis reports, or recovery previews.
You are advised to verify the Skill process in a demonstration or test namespace to ensure the correctness and security of the process before promoting it to production clusters.

Prerequisites

You have created a CCE cluster and have deployed the target workload.
You have installed or accessed AI CLI, and have registered related Skills.
AI CLI has permission to read workloads, pods, events, logs, version history, and server endpoints.
To perform restoration, AI CLI must also have permission to roll back, restart, scale, or modify workload configurations.

Involved Skills

Skill	Function
huawei-cloud-cce-workload-failure-diagnoser	Collects workloads, pods, events, logs, and version history, and outputs diagnosis conclusions.
huawei-cloud-cce-auto-remediation-runner	Generates a recovery preview and performs actions such as rollback, restart, and scaling after user confirmation.

Procedure

Step 1: Start an AI CLI Session

Start the AI CLI based on the actual access mode of your enterprise. If you use the CLI, you can start an interactive session.

hwcloud chat

If AI CLI has been connected to the O&M platform, ChatOps tool, or pipeline, you can directly initiate a natural language request through the corresponding entry.

Step 2: Describe the Fault in Natural Language

You do not need to manually combine multiple Kubernetes commands. Instead, you need to describe the target cluster, namespace, and fault symptom. It is recommended that you specify whether recovery operations are allowed and whether a preview is required before recovery in the input. For example, the prompt could be as follows:

Help me diagnose the workload release exception in the ai-diagnose-demo namespace of the cce-ai-ops-demo cluster in CN North-Beijing4. The symptom is that the pod is not ready for a long time after the latest update. Please analyze the root cause and provide recovery suggestions. Before performing recovery operations, please let me confirm.

If you are not sure about the abnormal object, you can ask AI CLI to scan the namespace first.

Check the abnormal workloads in the ai-diagnose-demo namespace of the cce-ai-ops-demo cluster, find the objects that fail to be released or whose pods are not ready, and output the diagnosis conclusion and recovery suggestions.

Step 3: Confirm the Diagnosis Scope

AI CLI identifies the region, cluster, namespace, resource type, fault time window, and fault symptom based on your input. If there are multiple abnormal workloads in the namespace, AI CLI lists the candidate objects and exception summary for you to confirm the diagnosis scope.

Confirm the following information:

Information	Description
Target cluster	Clusters can be identified by cluster name, cluster ID, or region and cluster name.
Namespace	Used to limit the diagnosis scope to avoid cross-service misoperations.
Workload type	Common workloads such as Deployments, StatefulSets, and DaemonSets can be diagnosed.
Fault symptom	For example, the pod is not ready, the rolling upgrade is suspended, the container is restarted, the image fails to be pulled, or the scheduling fails.
Recovery boundary	Whether actions such as rollback, restart, scaling, or configuration modification are allowed.

Step 4: Automatically Collect Diagnosis Evidence

After AI CLI invokes the workload fault diagnosis Skill, evidence should be collected from the control plane to the data plane to avoid relying solely on single-point logs or a single event.

Diagnosis Dimension	Key Evidence	Diagnosis Value
Workload status	Desired replicas, ready replicas, available replicas, updated replicas, and status conditions	Check whether the release is complete and whether the availability is affected.
Version mapping	Revision history, ReplicaSet status, and replica distribution of old and new versions	Check whether a healthy historical version exists and whether the rollback conditions are met.
Pod lifecycle	Pending, Running, Ready, RestartCount, and container status	Check whether the fault occurs in the scheduling, startup, running, or ready phase.
Kubernetes events	Events such as FailedScheduling, FailedPull, Unhealthy, and BackOff	Quickly locate scheduling, image, probe, and container startup issues.
Container logs	Recent exception logs, startup logs, and health check logs	Identify internal application errors, dependency exceptions, or configuration issues.
Probe configurations	readinessProbe, livenessProbe, and startupProbe configurations and results	Check whether the health check path, port, protocol, and timeout configurations are proper.
Service endpoints	Service, EndpointSlice, ingress, or load balancing backend	Check whether the fault affects service traffic access.
Observability data	Metric, alarm, and log trends	Identify external factors such as resource pressure, abnormal traffic, and dependency jitter.

Step 5: Output the Diagnosis Conclusion and Recovery Suggestions

After the diagnosis is complete, AI CLI should output a structured conclusion to help you quickly determine whether recovery operations are required.

Recommended output:

Output Item	Content
Diagnosis conclusion	Specify the fault type, such as probe failure, image pull failure, container startup failure, scheduling failure, resource insufficiency, or configuration exception.
Impact scope	Describe the affected namespaces, workload types, number of unavailable replicas, and impact on service access.
Key evidence	List the status conditions, events, logs, or metrics that support the conclusion. Do not output sensitive information.
Cause analysis	Describe the phase in which the fault occurred in the release link and why the workload became unavailable.
Recovery suggestions	Provide one or more recovery options and describe the scenarios, risk levels, and expected effects.
Whether confirmation is required	Mark change actions, such as rollback, restart, scaling, and configuration modification, as requiring user confirmation.

Step 6: Preview the Recovery Solution

If you want AI CLI to continue rectifying the fault, AI CLI should invoke the automatic recovery Skill to generate a recovery preview. In the preview phase, only the plan is displayed, and the cluster status is not changed.

Common recovery actions:

Recovery Action	Scenario	Risk Control
Rolling back to a healthy revision	The new version is unavailable after release, but the old version can still stably carry services.	Verify the historical version, available replicas, and rollback impact scope.
Restarting an abnormal pod	A single pod enters the abnormal state, and no obvious configuration issue is found in the workload template.	Avoid insufficient available replicas caused by batch restart.
Temporary scale-out	The available replicas are insufficient, and the service capacity needs to be restored first.	Evaluate the resource quota, node capacity, and HPA policy.
Correcting the workload configuration	The probe, environment variable, image tag, Secret, or ConfigMap reference is incorrect.	Display configuration differences and confirm the change window and rollback method.
Suspending or resuming a release	The rolling upgrade is abnormal. You need to prevent the impact from expanding or continue the release.	Specify the release status and subsequent manual actions after the suspension or resumption.

The recovery preview should include the following information:

Actions to be performed and the scope of target resources
Key configuration or status differences before and after a change
Risk level, service impact, and whether a change window is required
Verification method and rollback path
Specific statements that a user needs to confirm

Step 7: Confirm and Execute the Recovery

AI CLI can invoke the automatic recovery Skill to perform actions only after you confirm the recovery plan. You are advised to use clear expressions. For example:

Confirm that the recovery is performed based on the preview solution.

During the execution, AI CLI should continuously report the change status. If the recovery fails, subsequent changes should be stopped, and the failure cause, actions that have been performed, current cluster status, and recommended manual handling methods should be provided.

Step 8: Verify Recovery Results

After the recovery is complete, AI CLI needs to read the workload status and related evidence again to confirm whether the fault is actually rectified.

Recommended verification items:

Verification Item	Expected Result
Workload status	The desired replicas, ready replicas, and available replicas meet expectations, and the release status is stable.
Pod status	The pod is in the Running and Ready state, and there is no continuous restart, pull failure, or scheduling failure.
Events and logs	No similar high-frequency abnormal events occur, and no new key errors are recorded in container logs.
Service endpoints	The Service or EndpointSlice has available backends, and service traffic can be correctly forwarded.
Service detection	If health check or external detection has been configured, the detection result is normal.
Alarm status	Related alarms are cleared or enter the convergence state, and no new high-risk alarms are triggered.

Skill Execution Process

Phase	Input	Skill Action	Output
Target Identification	Region, cluster, namespace, and fault symptom in natural language	Parse the diagnosis scope, supplement missing information, and request user confirmation if necessary.	Clear diagnosis objectives and boundaries
Evidence collection	Target workload and time window	Query workloads, pods, events, logs, version history, and server endpoints.	Multi-dimensional diagnosis evidence
Cause analysis	Diagnosis evidence	Identify the fault phase, eliminate mismatched causes, and output confidence.	Root cause conclusion and key evidence
Recovery planning	Root cause, impact scope, and permission boundary	Generate a preview of rollback, restart, scaling, or configuration restoration.	Recovery plan, risk level, and verification method
User confirmation	User confirmation statement	Check whether the confirmed content matches the preview plan.	Executable recovery task
Recovery execution	Confirmed recovery task	Invoke Kubernetes APIs or CCE APIs to perform changes.	Execution result and intermediate status
Recovery verification	Workload status after recovery	Review status, events, logs, endpoints, and service probes.	Final conclusion and follow-up suggestions

Expected Results

After completing this practice, you can use AI CLI to complete the following closed-loop operations through natural language dialogs:

Identify the target CCE cluster, namespace, and abnormal workload.
Automatically aggregate evidence such as workload status, pod status, version history, events, logs, probes, and service endpoints.
Output explainable root cause analysis instead of just providing the result of a single command.
Preview the recovery actions to clarify the risks, impact scope, and verification methods.
Execute controlled recovery actions after user confirmation.
Automatically verify the recovery result and provide subsequent rectification suggestions.

Follow-up Suggestions

To improve the efficiency of troubleshooting CCE workload faults, you are advised to do the following based on this practice:

Standardize prompts.
Standardize the diagnosis prompt template in the team's SOP to specify the cluster, namespace, fault symptom, time window, and recovery boundary.
Complete the pre-release check.
Add image startup check, probe path verification, configuration reference verification, and basic smoke tests to the CI/CD pipeline.
Configure proper release policies.
Configure proper rolling update policies, PDBs, HPAs, and the number of revisions to be retained for key services to reduce the impact of release exceptions on services.
Enhance observability.
Access logs, metrics, events, and alarms so that AI CLI can determine root causes from more dimensions.
Establish change audit.
Record the operator, confirmation statement, execution action, execution result, and recovery verification conclusion of the recovery action triggered by AI CLI to facilitate review and compliance audit.

Helpful Links

Parent Topic: Huawei Cloud Cloud-Native Skill Best Practices

Previous topic: Huawei Cloud Cloud-Native Skill Best Practices

Next topic: Using OpenClaw to Perform Periodic Inspection on CCE Clusters

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

For any further questions, feel free to contact us through the chatbot.

Chatbot