Preparing GPU Resources

This section describes how you can plan and prepare basic software and hardware before using GPU capabilities.

Basic Planning

Resource	Version
Cluster types and versions	On-premises clusters: v1.28 or later
OS	Huawei Cloud EulerOS (HCE) 2.0
System architecture	x86
GPU	T4 and V100
Driver version	Only GPU driver 470.57.02, 510.47.03, 535.54.03, or 535.216.03 for GPU virtualization
Container runtime	containerd
Add-ons	The following add-ons must be installed in a cluster: Volcano: 1.10.1 or later gpu-device-plugin: 2.0.0 or later

Step 1: Add GPU Nodes to a Cluster and Label the Nodes

If there are GPU nodes that comply with the basic planning in your cluster, skip this procedure.

Add GPU nodes to your cluster. For details, see Adding Nodes to On-Premises Clusters.
Label the nodes with accelerator: nvidia-{GPU model}. For details, see Adding Labels/Taints to Nodes.

Figure 1 Labeling nodes that support GPU virtualization

Step 2: Install the Add-ons

If the add-ons that comply with the basic planning have been installed in your cluster, you can skip this procedure.

If the driver version is changed, restart the node to apply the change.

Before restarting a node, evict all pods on that node. Make sure to reserve GPU resources to avoid pod scheduling failures during node drainage. Insufficient resources can affect services.

Log in to the UCS console and click the cluster name to access the cluster console. In the navigation pane, choose Add-ons. Select Installed add-ons and check whether Volcano and gpu-device-plugin have been installed.
If the gpu-device-plugin add-on is not installed, install it by referring to gpu-device-plugin.

To enable GPU virtualization, install the Volcano add-on. For details, see Volcano.