Using AI CLI to Diagnose and Rectify CCE Workload Faults
Scenarios
In a CCE cluster, after workload release, scaling, or configuration change, there may be issues such as pod not ready for a long time, Deployment rolling upgrade suspended, frequent container restarts, image pull failures, scheduling failures, and service endpoint exceptions. Traditional troubleshooting usually requires O&M personnel to repeatedly switch between Deployments, ReplicaSets, pods, events, container logs, probe configurations, and monitoring data. The fault locating process is long and manual judgment is costly.
By combining AI CLI with cloud-native Skills, you can describe fault symptoms in natural language. The Agent automatically completes context identification, evidence collection, root cause analysis, recovery solution preview, user confirmation, recovery execution, and result verification. This helps O&M teams standardize workload troubleshooting into a repeatable process.
This practice applies to common scenarios of CCE workload fault diagnosis and recovery, and is not bound to a fixed namespace, workload name, or single test environment. In this document, ai-diagnose-demo is only a demo namespace. Replace it with your target cluster and service namespace.
Constraints
- The diagnosis accuracy of AI CLI depends on the cluster access permissions, log retention period, event integrity, and quality of observability data.
- All write operations must comply with the "preview + confirmation" mode. Rollback, restart, scaling, or configuration modification cannot be performed without confirmation.
- In the production environment, you are advised to separate diagnosis permissions from recovery permissions and record all AI CLI operations in audit logs.
- Rollback depends on the historical revisions of a workload. If a historical version has been cleared, restore it by re-releasing it or restoring its configurations.
- If faults involve database changes, data format changes, external dependencies, or message accumulation, evaluate data consistency and service compensation solutions before the rollback.
- In scenarios involving multiple clusters, namespaces, or abnormal objects, AI CLI should require users to confirm the target scope to avoid cross-service misoperations.
- Do not output sensitive information such as the AK/SK, token, certificate, and real project ID in prompts, diagnosis reports, or recovery previews.
- You are advised to verify the Skill process in a demonstration or test namespace to ensure the correctness and security of the process before promoting it to production clusters.
Prerequisites
- You have created a CCE cluster and have deployed the target workload.
- You have installed or accessed AI CLI, and have registered related Skills.
- AI CLI has permission to read workloads, pods, events, logs, version history, and server endpoints.
- To perform restoration, AI CLI must also have permission to roll back, restart, scale, or modify workload configurations.
Involved Skills
| Skill | Function |
|---|---|
| huawei-cloud-cce-workload-failure-diagnoser | Collects workloads, pods, events, logs, and version history, and outputs diagnosis conclusions. |
| huawei-cloud-cce-auto-remediation-runner | Generates a recovery preview and performs actions such as rollback, restart, and scaling after user confirmation. |
Procedure
Step 1: Start an AI CLI Session
Start the AI CLI based on the actual access mode of your enterprise. If you use the CLI, you can start an interactive session.
aicli chat
If AI CLI has been connected to the O&M platform, ChatOps tool, or pipeline, you can directly initiate a natural language request through the corresponding entry.
Step 2: Describe the Fault in Natural Language
You do not need to manually combine multiple Kubernetes commands. Instead, you need to describe the target cluster, namespace, and fault symptom. It is recommended that you specify whether recovery operations are allowed and whether a preview is required before recovery in the input. For example, the prompt could be as follows:
Help me diagnose the workload release exception in the ai-diagnose-demo namespace of the cce-ai-ops-demo cluster in CN North-Beijing4. The symptom is that the pod is not ready for a long time after the latest update. Please analyze the root cause and provide recovery suggestions. Before performing recovery operations, please let me confirm.
If you are not sure about the abnormal object, you can ask AI CLI to scan the namespace first.
Check the abnormal workloads in the ai-diagnose-demo namespace of the cce-ai-ops-demo cluster, find the objects that fail to be released or whose pods are not ready, and output the diagnosis conclusion and recovery suggestions.
Step 3: Confirm the Diagnosis Scope
AI CLI identifies the region, cluster, namespace, resource type, fault time window, and fault symptom based on your input. If there are multiple abnormal workloads in the namespace, AI CLI lists the candidate objects and exception summary for you to confirm the diagnosis scope.
Confirm the following information:
| Information | Description |
|---|---|
| Target cluster | Clusters can be identified by cluster name, cluster ID, or region and cluster name. |
| Namespace | Used to limit the diagnosis scope to avoid cross-service misoperations. |
| Workload type | Common workloads such as Deployments, StatefulSets, and DaemonSets can be diagnosed. |
| Fault symptom | For example, the pod is not ready, the rolling upgrade is suspended, the container is restarted, the image fails to be pulled, or the scheduling fails. |
| Recovery boundary | Whether actions such as rollback, restart, scaling, or configuration modification are allowed. |
Step 4: Automatically Collect Diagnosis Evidence
After AI CLI invokes the workload fault diagnosis Skill, evidence should be collected from the control plane to the data plane to avoid relying solely on single-point logs or a single event.
| Diagnosis Dimension | Key Evidence | Diagnosis Value |
|---|---|---|
| Workload status | Desired replicas, ready replicas, available replicas, updated replicas, and status conditions | Check whether the release is complete and whether the availability is affected. |
| Version mapping | Revision history, ReplicaSet status, and replica distribution of old and new versions | Check whether a healthy historical version exists and whether the rollback conditions are met. |
| Pod lifecycle | Pending, Running, Ready, RestartCount, and container status | Check whether the fault occurs in the scheduling, startup, running, or ready phase. |
| Kubernetes events | Events such as FailedScheduling, FailedPull, Unhealthy, and BackOff | Quickly locate scheduling, image, probe, and container startup issues. |
| Container logs | Recent exception logs, startup logs, and health check logs | Identify internal application errors, dependency exceptions, or configuration issues. |
| Probe configurations | readinessProbe, livenessProbe, and startupProbe configurations and results | Check whether the health check path, port, protocol, and timeout configurations are proper. |
| Service endpoints | Service, EndpointSlice, ingress, or load balancing backend | Check whether the fault affects service traffic access. |
| Observability data | Metric, alarm, and log trends | Identify external factors such as resource pressure, abnormal traffic, and dependency jitter. |
Step 5: Output the Diagnosis Conclusion and Recovery Suggestions
After the diagnosis is complete, AI CLI should output a structured conclusion to help you quickly determine whether recovery operations are required.
Recommended output:
| Output Item | Content |
|---|---|
| Diagnosis conclusion | Specify the fault type, such as probe failure, image pull failure, container startup failure, scheduling failure, resource insufficiency, or configuration exception. |
| Impact scope | Describe the affected namespaces, workload types, number of unavailable replicas, and impact on service access. |
| Key evidence | List the status conditions, events, logs, or metrics that support the conclusion. Do not output sensitive information. |
| Cause analysis | Describe the phase in which the fault occurred in the release link and why the workload became unavailable. |
| Recovery suggestions | Provide one or more recovery options and describe the scenarios, risk levels, and expected effects. |
| Whether confirmation is required | Mark change actions, such as rollback, restart, scaling, and configuration modification, as requiring user confirmation. |
Step 6: Preview the Recovery Solution
If you want AI CLI to continue rectifying the fault, AI CLI should invoke the automatic recovery Skill to generate a recovery preview. In the preview phase, only the plan is displayed, and the cluster status is not changed.
Common recovery actions:
| Recovery Action | Scenario | Risk Control |
|---|---|---|
| Rolling back to a healthy revision | The new version is unavailable after release, but the old version can still stably carry services. | Verify the historical version, available replicas, and rollback impact scope. |
| Restarting an abnormal pod | A single pod enters the abnormal state, and no obvious configuration issue is found in the workload template. | Avoid insufficient available replicas caused by batch restart. |
| Temporary scale-out | The available replicas are insufficient, and the service capacity needs to be restored first. | Evaluate the resource quota, node capacity, and HPA policy. |
| Correcting the workload configuration | The probe, environment variable, image tag, Secret, or ConfigMap reference is incorrect. | Display configuration differences and confirm the change window and rollback method. |
| Suspending or resuming a release | The rolling upgrade is abnormal. You need to prevent the impact from expanding or continue the release. | Specify the release status and subsequent manual actions after the suspension or resumption. |
The recovery preview should include the following information:
- Actions to be performed and the scope of target resources
- Key configuration or status differences before and after a change
- Risk level, service impact, and whether a change window is required
- Verification method and rollback path
- Specific statements that a user needs to confirm
Step 7: Confirm and Execute the Recovery
AI CLI can invoke the automatic recovery Skill to perform actions only after you confirm the recovery plan. You are advised to use clear expressions. For example:
Confirm that the recovery is performed based on the preview solution.
During the execution, AI CLI should continuously report the change status. If the recovery fails, subsequent changes should be stopped, and the failure cause, actions that have been performed, current cluster status, and recommended manual handling methods should be provided.
Step 8: Verify Recovery Results
After the recovery is complete, AI CLI needs to read the workload status and related evidence again to confirm whether the fault is actually rectified.
Recommended verification items:
| Verification Item | Expected Result |
|---|---|
| Workload status | The desired replicas, ready replicas, and available replicas meet expectations, and the release status is stable. |
| Pod status | The pod is in the Running and Ready state, and there is no continuous restart, pull failure, or scheduling failure. |
| Events and logs | No similar high-frequency abnormal events occur, and no new key errors are recorded in container logs. |
| Service endpoints | The Service or EndpointSlice has available backends, and service traffic can be correctly forwarded. |
| Service detection | If health check or external detection has been configured, the detection result is normal. |
| Alarm status | Related alarms are cleared or enter the convergence state, and no new high-risk alarms are triggered. |
Skill Execution Process
| Phase | Input | Skill Action | Output |
|---|---|---|---|
| Target Identification | Region, cluster, namespace, and fault symptom in natural language | Parse the diagnosis scope, supplement missing information, and request user confirmation if necessary. | Clear diagnosis objectives and boundaries |
| Evidence collection | Target workload and time window | Query workloads, pods, events, logs, version history, and server endpoints. | Multi-dimensional diagnosis evidence |
| Cause analysis | Diagnosis evidence | Identify the fault phase, eliminate mismatched causes, and output confidence. | Root cause conclusion and key evidence |
| Recovery planning | Root cause, impact scope, and permission boundary | Generate a preview of rollback, restart, scaling, or configuration restoration. | Recovery plan, risk level, and verification method |
| User confirmation | User confirmation statement | Check whether the confirmed content matches the preview plan. | Executable recovery task |
| Recovery execution | Confirmed recovery task | Invoke Kubernetes APIs or CCE APIs to perform changes. | Execution result and intermediate status |
| Recovery verification | Workload status after recovery | Review status, events, logs, endpoints, and service probes. | Final conclusion and follow-up suggestions |
Expected Results
After completing this practice, you can use AI CLI to complete the following closed-loop operations through natural language dialogs:
- Identify the target CCE cluster, namespace, and abnormal workload.
- Automatically aggregate evidence such as workload status, pod status, version history, events, logs, probes, and service endpoints.
- Output explainable root cause analysis instead of just providing the result of a single command.
- Preview the recovery actions to clarify the risks, impact scope, and verification methods.
- Execute controlled recovery actions after user confirmation.
- Automatically verify the recovery result and provide subsequent rectification suggestions.
Follow-up Suggestions
To improve the efficiency of troubleshooting CCE workload faults, you are advised to do the following based on this practice:
- Standardize prompts.
Standardize the diagnosis prompt template in the team's SOP to specify the cluster, namespace, fault symptom, time window, and recovery boundary.
- Complete the pre-release check.
Add image startup check, probe path verification, configuration reference verification, and basic smoke tests to the CI/CD pipeline.
- Configure proper release policies.
Configure proper rolling update policies, PDBs, HPAs, and the number of revisions to be retained for key services to reduce the impact of release exceptions on services.
- Enhance observability.
Access logs, metrics, events, and alarms so that AI CLI can determine root causes from more dimensions.
- Establish change audit.
Record the operator, confirmation statement, execution action, execution result, and recovery verification conclusion of the recovery action triggered by AI CLI to facilitate review and compliance audit.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot