Help Center/ Cloud Container Engine/ Skill Reference/ Huawei Cloud Cloud-Native Skill Best Practices/ Using AI CLI to Diagnose and Rectify CCE Workload Faults
Updated on 2026-06-05 GMT+08:00

Using AI CLI to Diagnose and Rectify CCE Workload Faults

Scenarios

In a CCE cluster, after workload release, scaling, or configuration change, there may be issues such as pod not ready for a long time, Deployment rolling upgrade suspended, frequent container restarts, image pull failures, scheduling failures, and service endpoint exceptions. Traditional troubleshooting usually requires O&M personnel to repeatedly switch between Deployments, ReplicaSets, pods, events, container logs, probe configurations, and monitoring data. The fault locating process is long and manual judgment is costly.

By combining AI CLI with cloud-native Skills, you can describe fault symptoms in natural language. The Agent automatically completes context identification, evidence collection, root cause analysis, recovery solution preview, user confirmation, recovery execution, and result verification. This helps O&M teams standardize workload troubleshooting into a repeatable process.

This practice applies to common scenarios of CCE workload fault diagnosis and recovery, and is not bound to a fixed namespace, workload name, or single test environment. In this document, ai-diagnose-demo is only a demo namespace. Replace it with your target cluster and service namespace.

Constraints

  • The diagnosis accuracy of AI CLI depends on the cluster access permissions, log retention period, event integrity, and quality of observability data.
  • All write operations must comply with the "preview + confirmation" mode. Rollback, restart, scaling, or configuration modification cannot be performed without confirmation.
  • In the production environment, you are advised to separate diagnosis permissions from recovery permissions and record all AI CLI operations in audit logs.
  • Rollback depends on the historical revisions of a workload. If a historical version has been cleared, restore it by re-releasing it or restoring its configurations.
  • If faults involve database changes, data format changes, external dependencies, or message accumulation, evaluate data consistency and service compensation solutions before the rollback.
  • In scenarios involving multiple clusters, namespaces, or abnormal objects, AI CLI should require users to confirm the target scope to avoid cross-service misoperations.
  • Do not output sensitive information such as the AK/SK, token, certificate, and real project ID in prompts, diagnosis reports, or recovery previews.
  • You are advised to verify the Skill process in a demonstration or test namespace to ensure the correctness and security of the process before promoting it to production clusters.

Prerequisites

  • You have created a CCE cluster and have deployed the target workload.
  • You have installed or accessed AI CLI, and have registered related Skills.
  • AI CLI has permission to read workloads, pods, events, logs, version history, and server endpoints.
  • To perform restoration, AI CLI must also have permission to roll back, restart, scale, or modify workload configurations.

Involved Skills

Skill

Function

huawei-cloud-cce-workload-failure-diagnoser

Collects workloads, pods, events, logs, and version history, and outputs diagnosis conclusions.

huawei-cloud-cce-auto-remediation-runner

Generates a recovery preview and performs actions such as rollback, restart, and scaling after user confirmation.

Procedure

Step 1: Start an AI CLI Session

Start the AI CLI based on the actual access mode of your enterprise. If you use the CLI, you can start an interactive session.

aicli chat

If AI CLI has been connected to the O&M platform, ChatOps tool, or pipeline, you can directly initiate a natural language request through the corresponding entry.

Step 2: Describe the Fault in Natural Language

You do not need to manually combine multiple Kubernetes commands. Instead, you need to describe the target cluster, namespace, and fault symptom. It is recommended that you specify whether recovery operations are allowed and whether a preview is required before recovery in the input. For example, the prompt could be as follows:

Help me diagnose the workload release exception in the ai-diagnose-demo namespace of the cce-ai-ops-demo cluster in CN North-Beijing4. The symptom is that the pod is not ready for a long time after the latest update. Please analyze the root cause and provide recovery suggestions. Before performing recovery operations, please let me confirm.

If you are not sure about the abnormal object, you can ask AI CLI to scan the namespace first.

Check the abnormal workloads in the ai-diagnose-demo namespace of the cce-ai-ops-demo cluster, find the objects that fail to be released or whose pods are not ready, and output the diagnosis conclusion and recovery suggestions.

Step 3: Confirm the Diagnosis Scope

AI CLI identifies the region, cluster, namespace, resource type, fault time window, and fault symptom based on your input. If there are multiple abnormal workloads in the namespace, AI CLI lists the candidate objects and exception summary for you to confirm the diagnosis scope.

Confirm the following information:

Information

Description

Target cluster

Clusters can be identified by cluster name, cluster ID, or region and cluster name.

Namespace

Used to limit the diagnosis scope to avoid cross-service misoperations.

Workload type

Common workloads such as Deployments, StatefulSets, and DaemonSets can be diagnosed.

Fault symptom

For example, the pod is not ready, the rolling upgrade is suspended, the container is restarted, the image fails to be pulled, or the scheduling fails.

Recovery boundary

Whether actions such as rollback, restart, scaling, or configuration modification are allowed.

Step 4: Automatically Collect Diagnosis Evidence

After AI CLI invokes the workload fault diagnosis Skill, evidence should be collected from the control plane to the data plane to avoid relying solely on single-point logs or a single event.

Diagnosis Dimension

Key Evidence

Diagnosis Value

Workload status

Desired replicas, ready replicas, available replicas, updated replicas, and status conditions

Check whether the release is complete and whether the availability is affected.

Version mapping

Revision history, ReplicaSet status, and replica distribution of old and new versions

Check whether a healthy historical version exists and whether the rollback conditions are met.

Pod lifecycle

Pending, Running, Ready, RestartCount, and container status

Check whether the fault occurs in the scheduling, startup, running, or ready phase.

Kubernetes events

Events such as FailedScheduling, FailedPull, Unhealthy, and BackOff

Quickly locate scheduling, image, probe, and container startup issues.

Container logs

Recent exception logs, startup logs, and health check logs

Identify internal application errors, dependency exceptions, or configuration issues.

Probe configurations

readinessProbe, livenessProbe, and startupProbe configurations and results

Check whether the health check path, port, protocol, and timeout configurations are proper.

Service endpoints

Service, EndpointSlice, ingress, or load balancing backend

Check whether the fault affects service traffic access.

Observability data

Metric, alarm, and log trends

Identify external factors such as resource pressure, abnormal traffic, and dependency jitter.

Step 5: Output the Diagnosis Conclusion and Recovery Suggestions

After the diagnosis is complete, AI CLI should output a structured conclusion to help you quickly determine whether recovery operations are required.

Recommended output:

Output Item

Content

Diagnosis conclusion

Specify the fault type, such as probe failure, image pull failure, container startup failure, scheduling failure, resource insufficiency, or configuration exception.

Impact scope

Describe the affected namespaces, workload types, number of unavailable replicas, and impact on service access.

Key evidence

List the status conditions, events, logs, or metrics that support the conclusion. Do not output sensitive information.

Cause analysis

Describe the phase in which the fault occurred in the release link and why the workload became unavailable.

Recovery suggestions

Provide one or more recovery options and describe the scenarios, risk levels, and expected effects.

Whether confirmation is required

Mark change actions, such as rollback, restart, scaling, and configuration modification, as requiring user confirmation.

Step 6: Preview the Recovery Solution

If you want AI CLI to continue rectifying the fault, AI CLI should invoke the automatic recovery Skill to generate a recovery preview. In the preview phase, only the plan is displayed, and the cluster status is not changed.

Common recovery actions:

Recovery Action

Scenario

Risk Control

Rolling back to a healthy revision

The new version is unavailable after release, but the old version can still stably carry services.

Verify the historical version, available replicas, and rollback impact scope.

Restarting an abnormal pod

A single pod enters the abnormal state, and no obvious configuration issue is found in the workload template.

Avoid insufficient available replicas caused by batch restart.

Temporary scale-out

The available replicas are insufficient, and the service capacity needs to be restored first.

Evaluate the resource quota, node capacity, and HPA policy.

Correcting the workload configuration

The probe, environment variable, image tag, Secret, or ConfigMap reference is incorrect.

Display configuration differences and confirm the change window and rollback method.

Suspending or resuming a release

The rolling upgrade is abnormal. You need to prevent the impact from expanding or continue the release.

Specify the release status and subsequent manual actions after the suspension or resumption.

The recovery preview should include the following information:

  • Actions to be performed and the scope of target resources
  • Key configuration or status differences before and after a change
  • Risk level, service impact, and whether a change window is required
  • Verification method and rollback path
  • Specific statements that a user needs to confirm

Step 7: Confirm and Execute the Recovery

AI CLI can invoke the automatic recovery Skill to perform actions only after you confirm the recovery plan. You are advised to use clear expressions. For example:

Confirm that the recovery is performed based on the preview solution.

During the execution, AI CLI should continuously report the change status. If the recovery fails, subsequent changes should be stopped, and the failure cause, actions that have been performed, current cluster status, and recommended manual handling methods should be provided.

Step 8: Verify Recovery Results

After the recovery is complete, AI CLI needs to read the workload status and related evidence again to confirm whether the fault is actually rectified.

Recommended verification items:

Verification Item

Expected Result

Workload status

The desired replicas, ready replicas, and available replicas meet expectations, and the release status is stable.

Pod status

The pod is in the Running and Ready state, and there is no continuous restart, pull failure, or scheduling failure.

Events and logs

No similar high-frequency abnormal events occur, and no new key errors are recorded in container logs.

Service endpoints

The Service or EndpointSlice has available backends, and service traffic can be correctly forwarded.

Service detection

If health check or external detection has been configured, the detection result is normal.

Alarm status

Related alarms are cleared or enter the convergence state, and no new high-risk alarms are triggered.

Skill Execution Process

Phase

Input

Skill Action

Output

Target Identification

Region, cluster, namespace, and fault symptom in natural language

Parse the diagnosis scope, supplement missing information, and request user confirmation if necessary.

Clear diagnosis objectives and boundaries

Evidence collection

Target workload and time window

Query workloads, pods, events, logs, version history, and server endpoints.

Multi-dimensional diagnosis evidence

Cause analysis

Diagnosis evidence

Identify the fault phase, eliminate mismatched causes, and output confidence.

Root cause conclusion and key evidence

Recovery planning

Root cause, impact scope, and permission boundary

Generate a preview of rollback, restart, scaling, or configuration restoration.

Recovery plan, risk level, and verification method

User confirmation

User confirmation statement

Check whether the confirmed content matches the preview plan.

Executable recovery task

Recovery execution

Confirmed recovery task

Invoke Kubernetes APIs or CCE APIs to perform changes.

Execution result and intermediate status

Recovery verification

Workload status after recovery

Review status, events, logs, endpoints, and service probes.

Final conclusion and follow-up suggestions

Expected Results

After completing this practice, you can use AI CLI to complete the following closed-loop operations through natural language dialogs:

  1. Identify the target CCE cluster, namespace, and abnormal workload.
  2. Automatically aggregate evidence such as workload status, pod status, version history, events, logs, probes, and service endpoints.
  3. Output explainable root cause analysis instead of just providing the result of a single command.
  4. Preview the recovery actions to clarify the risks, impact scope, and verification methods.
  5. Execute controlled recovery actions after user confirmation.
  6. Automatically verify the recovery result and provide subsequent rectification suggestions.

Follow-up Suggestions

To improve the efficiency of troubleshooting CCE workload faults, you are advised to do the following based on this practice:

  • Standardize prompts.

    Standardize the diagnosis prompt template in the team's SOP to specify the cluster, namespace, fault symptom, time window, and recovery boundary.

  • Complete the pre-release check.

    Add image startup check, probe path verification, configuration reference verification, and basic smoke tests to the CI/CD pipeline.

  • Configure proper release policies.

    Configure proper rolling update policies, PDBs, HPAs, and the number of revisions to be retained for key services to reduce the impact of release exceptions on services.

  • Enhance observability.

    Access logs, metrics, events, and alarms so that AI CLI can determine root causes from more dimensions.

  • Establish change audit.

    Record the operator, confirmation statement, execution action, execution result, and recovery verification conclusion of the recovery action triggered by AI CLI to facilitate review and compliance audit.