Supporting Kubernetes' Default GPU Scheduling
After GPU virtualization is enabled, the target GPU node does not support the workloads that use Kubernetes' default GPU scheduling by default, which are workloads using nvidia.com/gpu resources. If there are workloads using nvidia.com/gpu resources in your cluster, you can enable the GPU node to support GPU sharing in the gpu-device-plugin configuration so that the GPU node can support Kubernetes' default GPU scheduling.
- Enabling compatibility and setting nvidia.com/gpu to a decimal fraction (for example, 0.5) allow GPU virtualization to provide the specified nvidia.com/gpu quota for GPU memory isolation in workloads. Containers will receive GPU memory allocated based on the quota, such as 8 GiB (0.5 x 16 GiB). The GPU memory value must be a multiple of 128 MiB, or it will be rounded down to the nearest integer automatically. If nvidia.com/gpu resources have been used in the workload before compatibility is enabled, the resources will be from the entire GPU but not GPU virtualization.
- Enabling compatibility is equivalent to enabling GPU memory isolation when using the nvidia.com/gpu quota. The quota can be shared with workloads in GPU memory isolation mode, but not with those in compute and GPU memory isolation mode. In addition, Notes and Constraints on GPU virtualization must be followed.
- If compatibility is disabled, the nvidia.com/gpu quota specified in the workload only affects scheduling and is not restricted by GPU memory isolation. For example, even if you set the nvidia.com/gpu quota to 0.5, you can still access the entire GPU memory within the container. In addition, workloads using nvidia.com/gpu resources and workloads using virtualized GPU memory cannot be scheduled to the same node.
- If you deselect Virtualization nodes are compatible with GPU sharing mode, running workloads will not be affected, but workloads may fail to be scheduled. For example, if compatibility is disabled, workloads that use nvidia.com/gpu resources will remain in GPU memory isolation mode. As a result, the GPU cannot schedule workloads in compute and GPU memory isolation mode. To reschedule, you must remove the workloads that use nvidia.com/gpu resources.
Notes and Constraints
To support Kubernetes' default GPU scheduling on GPU nodes, the CCE AI Suite (NVIDIA GPU) add-on must be of v2.0.10 or later, and the Volcano Scheduler add-on must be of v1.10.5 or later.
Procedure
- Log in to the CCE console and click the cluster name to access the cluster console. In the navigation pane, choose Add-ons.
- Locate CCE AI Suite (NVIDIA GPU) on the right and click Install.
If the add-on has been installed, click Edit.
- Configure the add-on. For details, see Installing the add-on.
After GPU virtualization is enabled, you can configure the nvidia.com/gpu field to enable or disable the function of supporting Kubernetes' default GPU scheduling.
- Click Install.
Configuration Example
- Use kubectl to access the cluster.
- Create a workload that uses nvidia.com/gpu resources.
Create a gpu-app.yaml file. The following shows an example:
apiVersion: apps/v1 kind: Deployment metadata: name: gpu-app namespace: default spec: replicas: 1 selector: matchLabels: app: gpu-app template: metadata: labels: app: gpu-app spec: schedulerName: volcano containers: image: <your_image_address> # Replace it with your image address. name: container-0 resources: requests: cpu: 250m memory: 512Mi nvidia.com/gpu: 0.1 # Number of requested GPUs limits: cpu: 250m memory: 512Mi nvidia.com/gpu: 0.1 # Maximum number of GPUs that can be used imagePullSecrets: - name: default-secret
- Run the following command to create an application:
kubectl apply -f gpu-app.yaml
- Log in to the pod and check the total GPU memory allocated to the pod.
kubectl exec -it gpu-app -- nvidia-smi
Expected output:
Thu Jul 27 07:53:49 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A30 Off | 00000000:00:0D.0 Off | 0 | | N/A 47C P0 34W / 165W | 0MiB / 2304MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
The output shows that the total GPU memory that can be used by the pod is 2304 MiB.
In this example, the total GPU memory on the GPU node is 24258 MiB, but the number 2425.8 (24258 x 0.1) is not an integer multiple of 128 MiB. Therefore, the value 2425.8 is rounded down to 18 times of 128 MiB (18 x 128 MiB = 2304 MiB).
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot