Help Center/ Cloud Container Engine/ Best Practices/ Cloud Native AI/ Creating an HPA Policy Based on the GPU Usage of an Inference Service
Updated on 2026-06-17 GMT+08:00

Creating an HPA Policy Based on the GPU Usage of an Inference Service

In a CCE standard or Turbo cluster, you can configure HPA policies for workloads that use GPU resources based on GPU monitoring metrics. This enables applications to automatically scale out during peak hours and scale in during off-peak hours, optimizing resource utilization and reducing costs.

Prerequisites

  • A cluster is available. There are GPU nodes and GPU related services in the cluster.
  • CCE AI Suite (NVIDIA GPU) has been installed in the cluster, and the add-on reports GPU metrics properly. You can log in to a GPU node and run the following command:
    curl {Pod IP}:2112/metrics

    In the command, {Pod IP} must be the pod IP address of nvidia-gpu-device-plugin in CCE AI Suite (NVIDIA GPU), and metric results are expected to be returned.

  • Cloud Native Cluster Monitoring of 3.9.5 or later has been installed in the cluster. It has Local Data Storage enabled. Prometheus has been registered as a service of Metrics API. For details, see Providing Basic Resource Metrics Through the Metrics API. Kubernetes Metrics Server provides the Metrics API by default. If it has been installed in the cluster, you do not need to register Metrics API again.

Collecting GPU Metrics

  1. Log in to the CCE console and click the cluster name to access the cluster console. In the navigation pane, choose ConfigMaps and Secrets.
  2. Select the monitoring namespace. On the ConfigMaps tab, locate the row containing user-adapter-config and click Update.

    Figure 1 Updating a ConfigMap

  3. In the window that slides out from the right, click Edit in the Operation column of the config.yaml file in the Data area. Then, add a custom metric collection rule under the rules field. Click OK.

    You can add multiple collection rules by adding multiple configurations under the rules field. For details, see Metrics Discovery and Presentation Configuration.

    An example of custom collection rules for cce_gpu_utilization is shown below. For more GPU metrics, see GPU Metrics.
    rules:
      - seriesQuery: '{__name__=~"cce_gpu_utilization",container!="",namespace!="",pod!=""}'
        seriesFilters: []
        resources:
          overrides:
            namespace:
              resource: namespace
            pod:
              resource: pod
        metricsQuery: sum(last_over_time(<<.Series>>{<<.LabelMatchers>>}[1m])) by (<<.GroupBy>>)
    Figure 2 Configuring a custom collection rule

  4. Redeploy the custom-metrics-apiserver workload in the monitoring namespace.

    Figure 3 Redeploying custom-metrics-apiserver

  5. After the restart, check whether the metrics of the pod are normal.

    kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1"

    Information similar to the following is displayed:

    {"kind":"APIResourceList","apiVersion":"v1","groupVersion":"custom.metrics.k8s.io/v1beta1","resources":[{"name":"pods/cce_gpu_memory_utilization","singularName":"","namespaced":true,"kind":"MetricValueList","verbs":["get"]},{"name":"namespaces/cce_gpu_memory_utilization","singularName":"","namespaced":false,"kind":"MetricValueList","verbs":["get"]}]}

Deploying an Inference Service

  1. Deploy an inference service.

    You can store the checkpoint of Qwen3-8B to a specified path in advance and mount the path using hostPath. In this example, the checkpoint is stored in the local /root/wx/checkpoints/qwen/ directory, and the image is vllm/vllm-openai:v0.11.0.

    kind: Deployment
    apiVersion: apps/v1
    metadata:
      name: qwen-8b
      namespace: default
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: qwen-8b
          version: v1
      template:
        metadata:
          labels:
            app: qwen-8b
            version: v1
        spec:
          volumes:
            - name: vol-ckpt
              hostPath:
                path: /root/wx/checkpoints/qwen/ # Path of the Qwen3-8B checkpoint stored locally
                type: ''
          containers:
            - name: container-1
              image: vllm/vllm-openai:v0.11.0
              command:
                - /bin/sh
                - '-c'
              args:
                - vllm serve Qwen3-8B --model='/vllm-workspace/Qwen3-8B' --port=8000 --host=0.0.0.0 --max-model-len=20480 
              env:
                - name: TRANSFORMERS_OFFLINE
                  value: '1'
                - name: HF_DATASET_OFFLINE
                  value: '1'
              resources:
                limits:
                  cpu: '16'
                  memory: 80Gi
                  nvidia.com/gpu: '1'
                requests:
                  cpu: '16'
                  memory: 80Gi
                  nvidia.com/gpu: '1'
              volumeMounts:
                - name: vol-ckpt
                  mountPath: /vllm-workspace/Qwen3-8B
              imagePullPolicy: IfNotPresent
          restartPolicy: Always
          terminationGracePeriodSeconds: 30
          dnsPolicy: ClusterFirst
          securityContext: {}
          imagePullSecrets:
            - name: default-secret
      strategy:
        type: RollingUpdate
        rollingUpdate:
          maxUnavailable: 25%
          maxSurge: 25%
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: qwen-8b-svc
      namespace: default
    spec:
      selector:
        app: qwen-8b
        version: v1
      type: ClusterIP   
      ports:
        - name: http
          port: 8000        
          targetPort: 8000  

  2. Check the pod and Service deployment statuses.

    kubectl get pod
    kubectl get service

    The expected results are shown below.

  3. Verify the inference service. Replace the IP address with the Service IP address.

    curl http://10.247.90.143:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen3-8B","prompt": "San Francisco is a","max_tokens": 100,"temperature": 0}'

    The information similar to that shown in the figure below is displayed.

Creating an Auto Scaling Policy

  1. In the navigation pane, choose Workloads. In the workload list, locate the row containing the workload and click Auto Scaling in the Operation column.
  2. Set Policy Type to HPA+CronHPA and enable HPA.

    You can select GPU monitoring parameters in Custom Policy to create an auto scaling policy. An example is shown below.

    Figure 4 Selecting a custom metric

    cce_gpu_utilization (GPU usage) is used as an example. Configure other HPA parameters as required. For details, see Creating an HPA Policy.

  3. Return to the Scaling Policies tab and check whether the HPA policy has been created.

    Figure 5 HPA policy created

Testing Auto Scaling of the Inference Service

Verifying an HPA Scale-Out

  1. Download the ApacheBench (ab) tool for stress testing provided by Apache.

    # If the OS is Ubuntu:
    apt-get install apache2-utils
    # If the OS is CentOS or Huawei Cloud EulerOS:
    yum install httpd-tools

  2. Perform a stress test. In the commands, ab is the stress test tool, and body.json is the request body.

    The content of body.json is as follows:

    {
        "model": "Qwen3-8B",
        "prompt": "San Francisco is a",
        "max_tokens": 100,
        "temperature": 0
    }

    The pressure test commands are as follows:

    ab -n 4000 -c 1 -p body.json -T application/json http://10.247.90.143:8000/v1/completions

  3. Check the GPU workload and HPA statuses.

    • The GPU usage remains high.

    • The HPA detects that the GPU usage exceeds 20% and starts a scale-out.

    • Check the pod status. The pod scale-out succeeds.

Verifying an HPA Scale-In

After the pressure test is complete, check the GPU workload and HPA statuses.

  • The GPU usage decreases.

  • Wait for about 5 minutes (default scale-in time). The HPA triggers a scale-in.

  • Check the pod status. The pod scale-in succeeds.