Creating an HPA Policy Based on the GPU Usage of an Inference Service
In a CCE standard or Turbo cluster, you can configure HPA policies for workloads that use GPU resources based on GPU monitoring metrics. This enables applications to automatically scale out during peak hours and scale in during off-peak hours, optimizing resource utilization and reducing costs.
Prerequisites
- A cluster is available. There are GPU nodes and GPU related services in the cluster.
- CCE AI Suite (NVIDIA GPU) has been installed in the cluster, and the add-on reports GPU metrics properly. You can log in to a GPU node and run the following command:
curl {Pod IP}:2112/metricsIn the command, {Pod IP} must be the pod IP address of nvidia-gpu-device-plugin in CCE AI Suite (NVIDIA GPU), and metric results are expected to be returned.
- Cloud Native Cluster Monitoring of 3.9.5 or later has been installed in the cluster. It has Local Data Storage enabled. Prometheus has been registered as a service of Metrics API. For details, see Providing Basic Resource Metrics Through the Metrics API. Kubernetes Metrics Server provides the Metrics API by default. If it has been installed in the cluster, you do not need to register Metrics API again.
Collecting GPU Metrics
- Log in to the CCE console and click the cluster name to access the cluster console. In the navigation pane, choose ConfigMaps and Secrets.
- Select the monitoring namespace. On the ConfigMaps tab, locate the row containing user-adapter-config and click Update. Figure 1 Updating a ConfigMap
- In the window that slides out from the right, click Edit in the Operation column of the config.yaml file in the Data area. Then, add a custom metric collection rule under the rules field. Click OK.
You can add multiple collection rules by adding multiple configurations under the rules field. For details, see Metrics Discovery and Presentation Configuration.
An example of custom collection rules for cce_gpu_utilization is shown below. For more GPU metrics, see GPU Metrics.rules: - seriesQuery: '{__name__=~"cce_gpu_utilization",container!="",namespace!="",pod!=""}' seriesFilters: [] resources: overrides: namespace: resource: namespace pod: resource: pod metricsQuery: sum(last_over_time(<<.Series>>{<<.LabelMatchers>>}[1m])) by (<<.GroupBy>>)Figure 2 Configuring a custom collection rule
- Redeploy the custom-metrics-apiserver workload in the monitoring namespace. Figure 3 Redeploying custom-metrics-apiserver
- After the restart, check whether the metrics of the pod are normal.
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1"
Information similar to the following is displayed:
{"kind":"APIResourceList","apiVersion":"v1","groupVersion":"custom.metrics.k8s.io/v1beta1","resources":[{"name":"pods/cce_gpu_memory_utilization","singularName":"","namespaced":true,"kind":"MetricValueList","verbs":["get"]},{"name":"namespaces/cce_gpu_memory_utilization","singularName":"","namespaced":false,"kind":"MetricValueList","verbs":["get"]}]}
Deploying an Inference Service
- Deploy an inference service.
You can store the checkpoint of Qwen3-8B to a specified path in advance and mount the path using hostPath. In this example, the checkpoint is stored in the local /root/wx/checkpoints/qwen/ directory, and the image is vllm/vllm-openai:v0.11.0.
kind: Deployment apiVersion: apps/v1 metadata: name: qwen-8b namespace: default spec: replicas: 1 selector: matchLabels: app: qwen-8b version: v1 template: metadata: labels: app: qwen-8b version: v1 spec: volumes: - name: vol-ckpt hostPath: path: /root/wx/checkpoints/qwen/ # Path of the Qwen3-8B checkpoint stored locally type: '' containers: - name: container-1 image: vllm/vllm-openai:v0.11.0 command: - /bin/sh - '-c' args: - vllm serve Qwen3-8B --model='/vllm-workspace/Qwen3-8B' --port=8000 --host=0.0.0.0 --max-model-len=20480 env: - name: TRANSFORMERS_OFFLINE value: '1' - name: HF_DATASET_OFFLINE value: '1' resources: limits: cpu: '16' memory: 80Gi nvidia.com/gpu: '1' requests: cpu: '16' memory: 80Gi nvidia.com/gpu: '1' volumeMounts: - name: vol-ckpt mountPath: /vllm-workspace/Qwen3-8B imagePullPolicy: IfNotPresent restartPolicy: Always terminationGracePeriodSeconds: 30 dnsPolicy: ClusterFirst securityContext: {} imagePullSecrets: - name: default-secret strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 25% maxSurge: 25% --- apiVersion: v1 kind: Service metadata: name: qwen-8b-svc namespace: default spec: selector: app: qwen-8b version: v1 type: ClusterIP ports: - name: http port: 8000 targetPort: 8000 - Check the pod and Service deployment statuses.
kubectl get pod kubectl get service
The expected results are shown below.

- Verify the inference service. Replace the IP address with the Service IP address.
curl http://10.247.90.143:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen3-8B","prompt": "San Francisco is a","max_tokens": 100,"temperature": 0}'The information similar to that shown in the figure below is displayed.

Creating an Auto Scaling Policy
- In the navigation pane, choose Workloads. In the workload list, locate the row containing the workload and click Auto Scaling in the Operation column.
- Set Policy Type to HPA+CronHPA and enable HPA.
You can select GPU monitoring parameters in Custom Policy to create an auto scaling policy. An example is shown below.
Figure 4 Selecting a custom metric
cce_gpu_utilization (GPU usage) is used as an example. Configure other HPA parameters as required. For details, see Creating an HPA Policy.
- Return to the Scaling Policies tab and check whether the HPA policy has been created. Figure 5 HPA policy created
Testing Auto Scaling of the Inference Service
Verifying an HPA Scale-Out
- Download the ApacheBench (ab) tool for stress testing provided by Apache.
# If the OS is Ubuntu: apt-get install apache2-utils # If the OS is CentOS or Huawei Cloud EulerOS: yum install httpd-tools
- Perform a stress test. In the commands, ab is the stress test tool, and body.json is the request body.
The content of body.json is as follows:
{ "model": "Qwen3-8B", "prompt": "San Francisco is a", "max_tokens": 100, "temperature": 0 }The pressure test commands are as follows:
ab -n 4000 -c 1 -p body.json -T application/json http://10.247.90.143:8000/v1/completions
- Check the GPU workload and HPA statuses.
Verifying an HPA Scale-In
After the pressure test is complete, check the GPU workload and HPA statuses.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot





