Default GPU Scheduling in Kubernetes
You can use GPUs in CCE containers.
Prerequisites
- A GPU node has been created. For details, see Creating a Node.
- The CCE AI Suite (NVIDIA GPU) add-on has been installed. During the installation, select the driver corresponding to the GPU model on the node. For details, see CCE AI Suite (NVIDIA GPU).
- When the default GPU scheduling is used in clusters of v1.27 or earlier, the CCE AI Suite (NVIDIA GPU) add-on mounts the driver directory to /usr/local/nvidia/lib64, so you need to add /usr/local/nvidia/lib64 to the LD_LIBRARY_PATH environment variable to use GPUs in a container. You can skip this step for clusters of v1.28 or later.
You can add environment variables in any of the following ways:
- Configure the LD_LIBRARY_PATH environment variable in the Dockerfile used for creating an image. (Recommended)
ENV LD_LIBRARY_PATH /usr/local/nvidia/lib64:$LD_LIBRARY_PATH
- Configure the LD_LIBRARY_PATH environment variable in the image startup command.
/bin/bash -c "export LD_LIBRARY_PATH=/usr/local/nvidia/lib64:$LD_LIBRARY_PATH && ..."
- Define the LD_LIBRARY_PATH environment variable when creating a workload. (Ensure that this variable is not configured in the container. Otherwise, it will be overwritten.)
... env: - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64 ...
- Configure the LD_LIBRARY_PATH environment variable in the Dockerfile used for creating an image. (Recommended)
Using GPUs
Create a workload and request GPUs. You can specify the number of GPUs as follows:
apiVersion: apps/v1 kind: Deployment metadata: name: gpu-test namespace: default spec: replicas: 1 selector: matchLabels: app: gpu-test template: metadata: labels: app: gpu-test spec: containers: - image: nginx:perl name: container-0 resources: requests: cpu: 250m memory: 512Mi nvidia.com/gpu: 1 # Number of requested GPUs limits: cpu: 250m memory: 512Mi nvidia.com/gpu: 1 # Maximum number of GPUs that can be used imagePullSecrets: - name: default-secret
nvidia.com/gpu specifies the number of GPUs to be requested. The value can be smaller than 1. For example, nvidia.com/gpu: 0.5 indicates that multiple pods share a GPU. In this case, all the requested GPU resources come from the same GPU card.
When you use nvidia.com/gpu to specify the number of GPUs, the values of requests and limits must be the same.
After nvidia.com/gpu is specified, workloads will not be scheduled to nodes without GPUs. If the node is GPU-starved, Kubernetes events similar to the following are reported:
- 0/2 nodes are available: 2 Insufficient nvidia.com/gpu.
- 0/4 nodes are available: 1 InsufficientResourceOnSingleGPU, 3 Insufficient nvidia.com/gpu.
To use GPU resources on the CCE console, you only need to configure the GPU quota when creating a workload.
GPU Node Labels
CCE will label GPU-enabled nodes after they are created. Different types of GPU-enabled nodes have different labels.
$ kubectl get node -L accelerator NAME STATUS ROLES AGE VERSION ACCELERATOR 10.100.2.179 Ready <none> 8m43s v1.19.10-r0-CCE21.11.1.B006-21.11.1.B006 nvidia-t4
When using GPUs, you can enable the affinity between pods and nodes based on labels so that the pods can be scheduled to the correct nodes.
apiVersion: apps/v1 kind: Deployment metadata: name: gpu-test namespace: default spec: replicas: 1 selector: matchLabels: app: gpu-test template: metadata: labels: app: gpu-test spec: nodeSelector: accelerator: nvidia-t4 containers: - image: nginx:perl name: container-0 resources: requests: cpu: 250m memory: 512Mi nvidia.com/gpu: 1 # Number of requested GPUs limits: cpu: 250m memory: 512Mi nvidia.com/gpu: 1 # Maximum number of GPUs that can be used imagePullSecrets: - name: default-secret
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.