Creating an HPA Policy Based on the GPU Usage of an Inference Service

In a CCE standard or Turbo cluster, you can configure HPA policies for workloads that use GPU resources based on GPU monitoring metrics. This enables applications to automatically scale out during peak hours and scale in during off-peak hours, optimizing resource utilization and reducing costs.

Prerequisites

A cluster is available. There are GPU nodes and GPU related services in the cluster.
CCE AI Suite (NVIDIA GPU) has been installed in the cluster, and the add-on reports GPU metrics properly. You can log in to a GPU node and run the following command:
```
curl {Pod IP}:2112/metrics
```
In the command, {Pod IP} must be the pod IP address of nvidia-gpu-device-plugin in CCE AI Suite (NVIDIA GPU), and metric results are expected to be returned.
Cloud Native Cluster Monitoring of 3.9.5 or later has been installed in the cluster. It has Local Data Storage enabled. Prometheus has been registered as a service of Metrics API. For details, see Providing Basic Resource Metrics Through the Metrics API. Kubernetes Metrics Server provides the Metrics API by default. If it has been installed in the cluster, you do not need to register Metrics API again.

Collecting GPU Metrics

Log in to the CCE console and click the cluster name to access the cluster console. In the navigation pane, choose ConfigMaps and Secrets.
Select the monitoring namespace. On the ConfigMaps tab, locate the row containing user-adapter-config and click Update.

Figure 1 Updating a ConfigMap
In the window that slides out from the right, click Edit in the Operation column of the config.yaml file in the Data area. Then, add a custom metric collection rule under the rules field. Click OK.

You can add multiple collection rules by adding multiple configurations under the rules field. For details, see Metrics Discovery and Presentation Configuration.
An example of custom collection rules for cce_gpu_utilization is shown below. For more GPU metrics, see GPU Metrics.
```
rules:
  - seriesQuery: '{__name__=~"cce_gpu_utilization",container!="",namespace!="",pod!=""}'
    seriesFilters: []
    resources:
      overrides:
        namespace:
          resource: namespace
        pod:
          resource: pod
    metricsQuery: sum(last_over_time(<<.Series>>{<<.LabelMatchers>>}[1m])) by (<<.GroupBy>>)
```
Figure 2 Configuring a custom collection rule
Redeploy the custom-metrics-apiserver workload in the monitoring namespace.

Figure 3 Redeploying custom-metrics-apiserver

After the restart, check whether the metrics of the pod are normal.

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1"

Information similar to the following is displayed:

{"kind":"APIResourceList","apiVersion":"v1","groupVersion":"custom.metrics.k8s.io/v1beta1","resources":[{"name":"pods/cce_gpu_memory_utilization","singularName":"","namespaced":true,"kind":"MetricValueList","verbs":["get"]},{"name":"namespaces/cce_gpu_memory_utilization","singularName":"","namespaced":false,"kind":"MetricValueList","verbs":["get"]}]}

Deploying an Inference Service

Deploy an inference service.

You can store the checkpoint of Qwen3-8B to a specified path in advance and mount the path using hostPath. In this example, the checkpoint is stored in the local /root/wx/checkpoints/qwen/ directory, and the image is vllm/vllm-openai:v0.11.0.

kind: Deployment
apiVersion: apps/v1
metadata:
  name: qwen-8b
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: qwen-8b
      version: v1
  template:
    metadata:
      labels:
        app: qwen-8b
        version: v1
    spec:
      volumes:
        - name: vol-ckpt
          hostPath:
            path: /root/wx/checkpoints/qwen/ # Path of the Qwen3-8B checkpoint stored locally
            type: ''
      containers:
        - name: container-1
          image: vllm/vllm-openai:v0.11.0
          command:
            - /bin/sh
            - '-c'
          args:
            - vllm serve Qwen3-8B --model='/vllm-workspace/Qwen3-8B' --port=8000 --host=0.0.0.0 --max-model-len=20480 
          env:
            - name: TRANSFORMERS_OFFLINE
              value: '1'
            - name: HF_DATASET_OFFLINE
              value: '1'
          resources:
            limits:
              cpu: '16'
              memory: 80Gi
              nvidia.com/gpu: '1'
            requests:
              cpu: '16'
              memory: 80Gi
              nvidia.com/gpu: '1'
          volumeMounts:
            - name: vol-ckpt
              mountPath: /vllm-workspace/Qwen3-8B
          imagePullPolicy: IfNotPresent
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      securityContext: {}
      imagePullSecrets:
        - name: default-secret
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%
---
apiVersion: v1
kind: Service
metadata:
  name: qwen-8b-svc
  namespace: default
spec:
  selector:
    app: qwen-8b
    version: v1
  type: ClusterIP   
  ports:
    - name: http
      port: 8000        
      targetPort: 8000

Check the pod and Service deployment statuses.
```
kubectl get pod
kubectl get service
```
The expected results are shown below.

Verify the inference service. Replace the IP address with the Service IP address.

curl http://10.247.90.143:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen3-8B","prompt": "San Francisco is a","max_tokens": 100,"temperature": 0}'

The information similar to that shown in the figure below is displayed.

Click to enlarge

Creating an Auto Scaling Policy

In the navigation pane, choose Workloads. In the workload list, locate the row containing the workload and click Auto Scaling in the Operation column.
Set Policy Type to HPA+CronHPA and enable HPA.

You can select GPU monitoring parameters in Custom Policy to create an auto scaling policy. An example is shown below.

Figure 4 Selecting a custom metric

cce_gpu_utilization (GPU usage) is used as an example. Configure other HPA parameters as required. For details, see Creating an HPA Policy.
Return to the Scaling Policies tab and check whether the HPA policy has been created.

Figure 5 HPA policy created

Testing Auto Scaling of the Inference Service

Verifying an HPA Scale-Out

Download the ApacheBench (ab) tool for stress testing provided by Apache.

# If the OS is Ubuntu:
apt-get install apache2-utils
# If the OS is CentOS or Huawei Cloud EulerOS:
yum install httpd-tools

Perform a stress test. In the commands, ab is the stress test tool, and body.json is the request body.

The content of body.json is as follows:

{
    "model": "Qwen3-8B",
    "prompt": "San Francisco is a",
    "max_tokens": 100,
    "temperature": 0
}

The pressure test commands are as follows:

ab -n 4000 -c 1 -p body.json -T application/json http://10.247.90.143:8000/v1/completions

Check the GPU workload and HPA statuses.
- The GPU usage remains high.
- The HPA detects that the GPU usage exceeds 20% and starts a scale-out.
- Check the pod status. The pod scale-out succeeds.