Overview

CCE uses xGPU virtualization technologies to dynamically divide the GPU memory and computing power. A single GPU can be virtualized into up to 20 virtual GPU devices. Virtualization is more flexible than static allocation. You can specify the number of GPUs on the basis of stable service running to improve GPU utilization.

Advantages

The GPU virtualization function of CCE has the following advantages:

Flexible: The GPU computing ratio and memory size are finely tuned. The allocation granularity for computing is 5% of GPUs, while the memory allocation is in MiB.
Isolated: The memory of a single GPU can be isolated, and both GPU computing and memory can be isolated concurrently.
Compatible: There's no need to recompile services or replace the CUDA library.

Prerequisites

Item	Supported Version
Cluster version	v1.23.8-r0, v1.25.3-r0, or later
OS	Huawei Cloud EulerOS 2.0
GPU type	T4 and V100
Driver version	470.57.02, 510.47.03, and 535.54.03
Runtime	containerd
Add-on	The following add-ons must be installed in the cluster: Volcano Scheduler: 1.10.5 or later CCE AI Suite (NVIDIA GPU): 2.0.5 or later

Notes and Constraints

A single GPU can be virtualized into a maximum of 20 xGPU devices.
xGPUs cannot be used in init containers.
GPU virtualization supports two isolation modes: GPU memory isolation and isolation between GPU memory and computing power. A single GPU can schedule only workloads in the same isolation mode.
Autoscaler cannot be used to automatically scale in or out GPU nodes.
xGPU isolation does not allow you to request for GPU memory by calling CUDA API cudaMallocManaged(), which is also known as using UVM. For more information, see NVIDIA official documents. Use other methods to request for GPU memory, for example, by calling cudaMalloc().
When a containerized application is initializing, the real-time compute monitored by the nvidia-smi may exceed the upper limit of the available compute of the container.

Parent Topic: GPU Virtualization

Previous topic: GPU Virtualization

Next topic: Preparing xGPU Resources