Using Huawei Cloud Cloud-Native Skills
This section is intended for developers, O&M engineers, and architects who use Cloud Container Engine (CCE) and related cloud services. It describes the capacity positioning, usage, and reference of Huawei Cloud cloud-native Skills.
Skill Overview
What Are Skills?
Skills are open capabilities that convert professional knowledge, operation processes, and best practices into reusable capability units. In AI Agents, Skills are used to extend the professional capabilities of Agents so that Agents can automatically execute complex tasks in specific domains based on predefined processes and rules. The core features of Skills are as follows:
- Intent-driven: An Agent automatically understands when to trigger a Skill by reading the Skill's description. You do not need to explicitly specify the time.
- Scenario orchestration: A Skill can internally connect multiple steps to automatically collect contexts, analyze them, and output conclusions.
- Reusable: A Skill can run on different Agent platforms (web, CLI, and API). You do not need to adapt the Skill to each platform.
- Composable: Multiple Skills can be combined based on workflows. Agents automatically select and invoke appropriate Skills based on task requirements.
- Security guardrails: Risk constraints are defined within Skills. High-risk operations must be previewed and confirmed by users.
Huawei Cloud cloud-native Skills are encapsulated O&M capabilities of cloud services such as CCE, AOM, LTS, ELB, ECS, and HSS based on scenarios such as fault diagnosis, observability analysis, inspection and governance, and automatic recovery. They enable AI Agents to have professional cloud native O&M capabilities.
Scenarios
- Fault diagnosis: Pod CrashLoopBackOff, node NotReady, Ingress 502, PVC Pending, and other faults
- Observable analysis: AOM alarms, LTS logs, Kubernetes events, and pod/node metrics are aggregated to form diagnosis contexts.
- Inspection and governance: daily cluster health check, capacity trend prediction, cost optimization suggestions, and availability risk scanning
- Automatic recovery: controlled changes such as scaling, cordoning or draining nodes, restarting ECSs, and fixing HSS vulnerabilities
- Delivery solution: container migration planning, resource stocktaking, and dependency matrix analysis
- Cluster management: CCE cluster upgrade planning, workload management, and UCS cluster management and policy governance
Security Constraints and Risk Levels
- Core security constraints
- Do not output AKs/SKs in scripts, logs, or reports.
- Preview all operations, such as deletion, scaling, drain, and reboot, before confirming them.
- Delete temporary kubeconfig files and certificate files after using them.
- Use diagnosis, inspection, and migration planning Skills only for read-only queries and report generation.
- Risk levels
Risk Level
Example
Default Behavior
R0
list/get/query/analyze
Direct execution
R1
Generating reports, solutions, and dashboards
Direct execution
R2
Restarting abnormal pods and providing suggestions after query
Preview by default, with configurable automatic execution
R3
Scaling, rollback, cordon, and uncordon
confirm=true
R4
Deleting, draining, and hibernating production clusters
confirm=true and strong risk warning
R5
Clearing data and performing irreversible cross-domain deletion
Forbidden by default
Usage Constraints
- Currently, Skills are mainly designed for CCE clusters and their associated cloud services (such as AOM, LTS, ELB, ECS, and HSS).
- All change actions are in preview mode by default and are not automatically executed.
Usage Description
Working Principles
A Skill works based on the intent matching mechanism. An Agent reads the description in the header of the SKILL.md file in the Skill directory. When your question matches the description, the Agent automatically triggers the Skill. A Skill defines a complete processing workflow, a list of tools that can be invoked, and risk constraints. The Agent executes tasks step by step based on the Skill's guidance.
For example, when you ask "What should I do if a pod keeps restarting?", the Agent matches the description of pod-failure-diagnoser as follows:
--- name: pod-failure-diagnoser description: Diagnose CCE Pod failures such as CrashLoopBackOff, ImagePullBackOff, OOMKilled, Pending, Evicted, restart storms, or workload unavailable. ---
The Agent determines that the issue matches the description and automatically triggers pod-failure-diagnoser to execute the diagnosis process.
Obtaining Skills
Huawei Cloud cloud-native Skills are provided through an open repository at GitHub.
Each Skill uses a self-contained directory structure that contains the description and auxiliary files required to run the Skill.
skill-name/ ├── SKILL.md # Skill definition file, which is the only entry ├── references/ # Reference documents ├── scripts/ # Executable scripts ├── templates/ # Template files └── demo/ # Demonstration examples
Skill Installation
- Method 1: Using npx
# Install a single Skill. npx skills add huaweicloud/huaweicloud-skills --skill <skill-name> # Install all Skills. npx skills add huaweicloud/huaweicloud-skills
- Method 2: Using the GitHub repository for manual installation
git clone https://github.com/huaweicloud/huaweicloud-skills.git # Install a specified Skill. npx skills add <path>/huaweicloud-skills/skills/<skill-name>
The loading paths and integration methods vary depending on the Agent platform. For details, see Platform Integration Example.
Authentication Configuration
Before using Skills related to Huawei Cloud products, configure authentication information based on the target cloud service.
- Interactive configuration
Access Key Id: <your AK> Secret Access Key: <your SK>
- AccessKey authentication configuration using KooCLI
hcloud configure set --cli-access-key="<your AK>" --cli-secret-key="<your SK>" --cli-mode="AKSK"
- Use plaintext AK/SK authentication only in the trusted local test environment to prevent credential leakage.
- The cloud environment must comply with the principle of least privilege and follow the instructions provided in Identity Authentication and Access Control.
- Do not write AKs/SKs into scripts, logs, reports, or code repositories.
Reference
Overview
Huawei Cloud cloud-native Skills are organized around cloud-native resource management and continuous O&M scenarios, covering capability domains such as resource lifecycle, observability and alarms, fault diagnosis and recovery, inspection and governance, solution and delivery, and multi-cloud and multi-cluster management.
Each Skill is provided as an independent directory, including the capability description, application scenarios, and necessary reference documents. You can select a single Skill or combine multiple Skills to complete cross-service and cross-step O&M tasks based on service requirements. The following lists available Skills by capability domain.
Lifecycle and Resource Management
Lifecycle and resource management covers CCE, CCI, and SWR. The product names are used only for grouping. Each row in the following table represents an independent Skill.
- CCE
Skill
Directory Path
Function
huawei-cloud-cce-cluster-management
skills/huawei-cloud-cce-cluster-management
Manages the full lifecycle of CCE clusters, node pools, nodes, add-ons, EIPs, and kubeconfig.
cce-cluster-upgrade-planner
skills/cce/cce-cluster-upgrade-planner
Plans the CCE Kubernetes version upgrade and checks the upgrade path, add-on compatibility, different items, and upgrade window.
cce-workload-manager
skills/cce/cce-workload-manager
Manages CCE workloads and Kubernetes resources, including Deployments, StatefulSets, DaemonSets, jobs, CronJobs, HPAs, Services, ingresses, and ConfigMaps.
- CCI
Skill
Directory Path
Function
huawei-cloud-cci-instance-management
skills/cci/huawei-cloud-cci-instance-management
Manages CCI, including namespaces, networks, Deployments, StatefulSets, pods, EIPPools, logs, and metrics.
- SWR
Skill
Directory Path
Function
huawei-cloud-swr-image-management
skills/swr/huawei-cloud-swr-image-management
Manages SWR namespaces, repositories, tags, login credentials, and quotas.
huawei-cloud-swr-image-governance
skills/swr/huawei-cloud-swr-image-governance
Manages SWR permissions, retention policies, sharing policies, agencies, and immutability rules.
huawei-cloud-swr-image-automation
skills/swr/huawei-cloud-swr-image-automation
Manages SWR image synchronization, triggers, and automatic deployment processes.
huawei-cloud-swr-enterprise-instance
skills/swr/huawei-cloud-swr-enterprise-instance
Manages SWR Enterprise Edition, namespaces, repositories, artifacts, credentials, endpoints, and domain names.
Observability and Intelligent Alarms
| Skill | Directory Path | Function |
|---|---|---|
| observability-context-builder | skills/observability-context-builder | Aggregates AOM alarms, metrics, LTS logs, pod logs, and Kubernetes events to form diagnosis contexts. |
| alarm-correlation-engine | skills/alarm-correlation-engine | Performs association analysis on AOM active and historical alarms, deduplicates and merges alarms, groups alarms by severity, and checks alarm rules. |
| log-analyzer | skills/log-analyzer | Queries and analyzes pod standard output, CCE LogConfig application logs, and LTS logs. |
| kubernetes-event-analyzer | skills/kubernetes-event-analyzer | Queries and analyzes Kubernetes warning events, repetition patterns, and pod, node, and workload exceptions. |
| metric-analyzer | skills/metric-analyzer | Queries and analyzes CCE pod, node, and ECS, ELB, EIP, and NAT metrics to identify threshold exceptions. |
Fault Diagnosis and Self-Healing
| Skill | Directory Path | Function |
|---|---|---|
| pod-failure-diagnoser | skills/pod-failure-diagnoser | Diagnoses pod faults such as CrashLoopBackOff, ImagePullBackOff, OOMKilled, Pending, Evicted, and frequent restarts. |
| workload-failure-diagnoser | skills/workload-failure-diagnoser | Diagnoses Deployment, StatefulSet, and DaemonSet release failures, rolling upgrade suspension, insufficient replicas, and probe exceptions. |
| node-failure-diagnoser | skills/node-failure-diagnoser | Diagnoses Node NotReady, resource pressure, NPD, CNI, kubelet, and container runtime exceptions. |
| autoscaling-diagnoser | skills/autoscaling-diagnoser | Diagnoses HPA and Cluster Autoscaler link faults. |
| network-failure-diagnoser | skills/network-failure-diagnoser | Diagnoses Service, DNS, ingress, NetworkPolicy, ELB, EIP, NAT, and VPC network faults. |
| storage-failure-diagnoser | skills/storage-failure-diagnoser | Diagnoses PVC, PV, EVS, SFS, OBS, mounting, capacity, and deletion protection faults. |
| root-cause-analyzer | skills/root-cause-analyzer | Summarizes cross-domain evidence and outputs top root causes, impact scope, confidence, and recovery handover. |
| change-impact-analyzer | skills/change-impact-analyzer | Analyze the fault impacts caused by release, configuration, network, security policy, and node changes. |
| dependency-impact-analyzer | skills/dependency-impact-analyzer | Analyzes the fault propagation path and upstream and downstream impacts based on the Service, ingress, pod, and node topologies. |
| auto-remediation-runner | skills/auto-remediation-runner | Generates and executes controlled recovery actions. All high-risk changes are previewed by default and require explicit confirmation. |
Inspection, Governance, and Continuous O&M
| Skill | Directory Path | Function |
|---|---|---|
| daily-cluster-inspector | skills/daily-cluster-inspector | Performs periodic CCE health checks, quick inspections, and continuous O&M summaries. |
| availability-risk-scanner | skills/availability-risk-scanner | Scan for HA, AZ distribution, single replica, PDB, probe, affinity, gateway, and resource overcommitment risks. |
| capacity-trend-forecaster | skills/capacity-trend-forecaster | Analyzes periodic capacity trends, predicts resource bottlenecks, and simulates HPA and node scaling policies. |
| cost-optimization-advisor | skills/cost-optimization-advisor | Analyzes idle resources, excessive requests, low-usage nodes, and scaling policy optimization opportunities. |
| ops-report-generator | skills/ops-report-generator | Summarizes inspection, capacity, availability, cost, and on-call contexts to generate weekly, monthly, SLA, capacity, and stability reports. |
Solution and Delivery
| Skill | Directory Path | Function |
|---|---|---|
| cce-cci-bursting-deployer | skills/cce-cci-bursting-deployer | Configures, deploys, and verifies the auto scaling capability from CCE to CCI 2.0, including VPCEP, virtual-kubelet, and smoke testing. |
| container-migration-planner | skills/container-migration-planner | Counts container platform resources and dependencies, and outputs migration batches, risks, and verification solutions. No real migration is performed. |
| Skill for full-link pressure test | skills/pressure-test | Builds a full-link pressure test from the k6 client through ELB and nginx-ingress to the service pod, collects observability data, and outputs a performance report. |
Multi-Cloud and Multi-Cluster Management
UCS-related Skills are placed in this category and are no longer included in CCE lifecycle management.
| Skill | Directory Path | Function |
|---|---|---|
| ucs-cluster-onboarding-manager | skills/ucs/ucs-cluster-onboarding-manager | Manages UCS clusters, lifecycle, fleet groups, kubeconfig, and resource quotas. |
| ucs-policy-governor | skills/ucs/ucs-policy-governor | Manages UCS policy instances, policy definitions, start and stop operations, execution statuses, and fleet compliance audit. |
Usage
An Agent automatically matches capabilities based on the description in the SKILL.md file of each Skill. If manual locating is required, you can find the target Skill according to this document and then go to the corresponding directory to view the complete description and reference documents.
Using Skills in OpenCode
OpenCode is an AI programming assistant for terminals. It allows you to load a Skill through the project directory or user directory.
- Skill types
- Project-level Skill: Place the Skill directory in the skills/ folder in the root directory of the project.
my-project/ ├── src/ ├── skills/ │ ├── pod-failure-diagnoser/ │ │ ├── SKILL.md │ │ ├── manifest.json │ │ ├── skill-profile.yaml │ │ └── references/ │ ├── node-failure-diagnoser/ │ └── ...
When OpenCode is started, it automatically scans the skills/ folder in the project directory and loads all Skills. You can directly describe the issue in the dialog, and the Agent will automatically match the appropriate Skill based on the description.
- User-level Skill: Place the Skill directory in the user configuration directory. User-level Skills take effect for all projects and are suitable for common O&M Skills.
- Windows: %USERPROFILE%\.opencode\skills\
- Linux/macOS: ~/.opencode/skills/
- Project-level Skill: Place the Skill directory in the skills/ folder in the root directory of the project.
- Example
# Go to the project directory. cd my-project # Start OpenCode. Skills have been automatically loaded. opencode # Describe the issue in the dialog. > My pod keeps restarting. Can you help me check? # The Agent automatically triggers pod-failure-diagnoser.
Using Skills in OpenClaw
OpenClaw is an open-source, self-hosted gateway that connects chat applications and channels to AI Agents. You can run the gateway locally or on your own server and extend Agent capabilities through Skills.
OpenClaw can load Skills from the following directories:
| Directory | Description |
|---|---|
| <workspace>/skills/ | Skills in the current workspace, suitable for project-level customization |
| <workspace>/.agents/skills/ | Project-level Skills for Agents in the current workspace |
| ~/.agents/skills/ | Skills that can be shared by multiple Agents |
| ~/.openclaw/skills/ | Skills managed by OpenClaw |
| skills.load.extraDirs | Skill directories that can be added through configuration |
OpenClaw also loads the Skills that come with the installation. You can copy the required Skill directories to the corresponding loading directories. Example:
mkdir -p ~/.agents/skills cp -R ./skills/pod-failure-diagnoser ~/.agents/skills/ cp -R ./skills/node-failure-diagnoser ~/.agents/skills/
Each Skill directory must contain SKILL.md. After OpenClaw loads Skills, the Agent can select an appropriate Skill based on your intent and execute tasks based on the workflow defined in the Skill.
For details about the positioning, Skill loading sequence, and directory description of OpenClaw, see the OpenClaw documentation and OpenClaw Skills.
Using Skills in Hermes
Hermes is a service orchestration platform for enterprise-class AI Agents. It supports Skill integration through declarative configuration.
Common Issues
When describing an issue, you can refer to the following table to quickly locate the recommended Skill.
| Issue Description | Recommended Skill |
|---|---|
| Pod keeping restart, Pending, and OOMKilled | pod-failure-diagnoser |
| Release failure, rolling upgrade suspension, and insufficient replicas | workload-failure-diagnoser |
| Node NotReady, resource pressure, and node vulnerabilities | node-failure-diagnoser |
| HPA not scaling pods, CA not scaling nodes, and auto scaling not taking effect | autoscaling-diagnoser |
| Ingress 502, Service unreachable, ELB link exception | network-failure-diagnoser |
| PVC Pending, FailedMount, and capacity exhaustion | storage-failure-diagnoser |
| A large number of CCE alarms, which need to be combined for analysis | alarm-correlation-engine |
| Pod standard output or LTS application log query | log-analyzer |
| Kubernetes event trend analysis | kubernetes-event-analyzer |
| Query of CCE pod/node metrics and rankings by resource usage | metric-analyzer |
| Aggregation of logs, events, metrics, and alarms | observability-context-builder |
| Service unavailability, requiring comprehensive root cause analysis | root-cause-analyzer |
| Faults upon release, configuration, network, security policy, or node changes | change-impact-analyzer |
| Determining entries and upstream and downstream services affected by a service fault | dependency-impact-analyzer |
| Capacity expansion, restart, draining, and vulnerability fixing | auto-remediation-runner |
| Daily inspection or periodic health check | daily-cluster-inspector |
| Cost optimization and excessive request analysis | cost-optimization-advisor |
| Capacity trend prediction and scaling simulation | capacity-trend-forecaster |
| Availability risk scanning and PDB/probe check | availability-risk-scanner |
| Weekly, monthly, and SLA O&M reports | ops-report-generator |
| Container migration solution and resource stocktaking | container-migration-planner |
| Auto scaling configuration for scheduling CCE workloads to CCI | cce-cci-bursting-deployer |
| CCE cluster version upgrade planning | cce-cluster-upgrade-planner |
| CCE/UCS workload management | cce-workload-manager |
| UCS cluster management and fleet management | ucs-cluster-onboarding-manager |
| UCS policy governance and compliance audit | ucs-policy-governor |
| SWR image lifecycle management | huawei-cloud-swr-image-management |
| SWR image governance | huawei-cloud-swr-image-governance |
| SWR image automation | huawei-cloud-swr-image-automation |
| Pressure test solution and execution | Skill for full-link pressure test |
Helpful Links
| Document | Description | Path |
|---|---|---|
| CCE documentation | CCE documentation | |
| Open Skill repository | Huawei Cloud cloud-native skill code repository |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot