Enabling Auto Scaling for a GPU Node
If there are not enough GPU resources in a cluster, GPU nodes can be scaled out automatically. This section describes how to create an auto scaling policy for a GPU node.
Prerequisites
- You have installed the CCE AI Suite (NVIDIA GPU) and CCE Cluster Autoscaler in the cluster.
- The Auto Node Scale-Out function is enabled. To do so, go to Settings, click the Auto Scaling tab and enable Auto Node Scale-Out under Node Scale-Out Criteria.
Step 1: Configure a Node Pool
- Log in to the CCE console and click the cluster name to access the cluster console. In the navigation pane, choose Nodes.
- In the right pane, click the Node Pools tab and click Create Node Pool in the upper right corner. For details, see Creating a Node Pool.
- After the node pool is created, click Auto Scaling. In the AS Object area, enable Auto Scaling for the target specification and click OK.
Step 2: Create a GPU Workload and Enable Auto Scale-Out
- Use the following YAML to create a GPU workload:
apiVersion: apps/v1 kind: Deployment metadata: name: ac-test namespace: default spec: replicas: 1 # Number of replicas selector: matchLabels: app: ac-test template: metadata: labels: app: ac-test spec: restartPolicy: Always containers: - name: container-1 image: pytorch/pytorch:2.1.1-cuda12.1-cudnn8-devel imagePullPolicy: IfNotPresent command: ["/bin/bash", "-c"] args: - "while true; do nvidia-smi; sleep 10; done" resources: requests: cpu: 250m memory: 512Mi nvidia.com/gpu: 1 limits: cpu: 250m memory: 512Mi nvidia.com/gpu: 1 # Node affinity: Schedule the pod to the target GPU node pool (the one created previously). affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cce.cloud.com/cce-nodepool operator: In values: - gpu-130-nodepool-67633 # GPU node pool name - Check the pods and nodes in the node pool. The node pool does not contain any node, and the pod is in the Pending state.

- Verify that a node pool scale-out has been triggered.

- Check the pods and nodes in the node pool. A new node is created in the node pool, and the pod is in the Running state.

Step 3: Delete the GPU Workload and Enable Auto Scale-In
If the number of GPU resources required by the GPU workload decrease and the node is idle, you can enable auto node scale-in to save resources.
- Go to Settings, click the Auto Scaling tab, enable Auto Node Scale-In under Auto Scale-In Settings, and configure the scale-in conditions as required. For details, see Auto Scaling.
- Verify that the idle node starts to be scaled in after the GPU workload has been deleted.

- Check whether the node has been removed.

Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot