Updated on 2024-09-30 GMT+08:00

Preparing xGPU Resources

CCE uses xGPU virtualization technologies to dynamically divide the GPU memory and computing power. A single GPU can be virtualized into up to 20 virtual GPU devices. This section describes how to implement GPU scheduling and isolation capabilities on GPU nodes.

Prerequisites

Item

Supported Version

Cluster version

v1.23.8-r0, v1.25.3-r0, or later

OS

Huawei Cloud EulerOS 2.0

GPU type

T4 and V100

Driver version

470.57.02, 510.47.03, and 535.54.03

Runtime

containerd

Add-on

The following add-ons must be installed in the cluster:

Step 1: Enable GPU Virtualization

Both CCE AI Suite (NVIDIA GPU) and Volcano Scheduler must be installed in the cluster.

Step 2: Create a GPU Node

Create nodes that support GPU virtualization in the cluster to use the GPU virtualization function. For details, see Creating a Node or Creating a Node Pool.

If your cluster already has GPU nodes that meet the Prerequisites, skip this step.

Step 3 (Optional): Modifying the Volcano Scheduling Policy

The default scheduling policy of Volcano for GPU nodes is Spread. If the node configurations are the same, Volcano selects the node with the minimum number of running containers, so that containers can be evenly allocated to each node. In contrast, the bin packing policy attempts to schedule all containers to one node to avoid resource fragmentation.

If the bin packing policy is required when the GPU virtualization feature is used, you can modify the policy in the advanced settings of the Volcano add-on. The procedure is as follows:

  1. Log in to the CCE console and click the cluster name to access the cluster console. In the navigation pane, choose Add-ons.
  2. Find the Volcano add-on on the right and click Edit.
  3. On the displayed page, modify the advanced settings.

    1. In the nodeorder add-on, add the arguments parameter and set leastrequested.weight to 0. That is, set the priority of the node with the fewest allocated resources to 0.
    2. Add the bin packing add-on, and specify the weights of xGPU customized resources (volcano.sh/gpu-core.percentage and volcano.sh/gpu-mem.128Mi).
    A complete example is as follows:
    {
        "colocation_enable": "",
        "default_scheduler_conf": {
            "actions": "allocate, backfill, preempt",
            "tiers": [
                {
                    "plugins": [
                        {
                            "name": "priority"
                        },
                        {
                            "enablePreemptable": false,
                            "name": "gang"
                        },
                        {
                            "name": "conformance"
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "enablePreemptable": false,
                            "name": "drf"
                        },
                        {
                            "name": "predicates"
                        },
                        {
                            "name": "nodeorder",
                            // Set the priority of the node with the fewest allocated resources to 0.
                            "arguments": {
                                "leastrequested.weight": 0
                            }
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "cce-gpu-topology-predicate"
                        },
                        {
                            "name": "cce-gpu-topology-priority"
                        },
                        {
                            "name": "xgpu"
                        },
                        // Add the bin packing add-on, and specify the weights of xGPU resources.
                        {
                            "name": "binpack",
                            "arguments": {
                                "binpack.resources": "volcano.sh/gpu-core.percentage,volcano.sh/gpu-mem.128Mi",
                                "binpack.resources.volcano.sh/gpu-mem.128Mi": 10,
                                "binpack.resources.volcano.sh/gpu-core.percentage": 10
                            }
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "nodelocalvolume"
                        },
                        {
                            "name": "nodeemptydirvolume"
                        },
                        {
                            "name": "nodeCSIscheduling"
                        },
                        {
                            "name": "networkresource"
                        }
                    ]
                }
            ]
        },
        "tolerations": [
            {
                "effect": "NoExecute",
                "key": "node.kubernetes.io/not-ready",
                "operator": "Exists",
                "tolerationSeconds": 60
            },
            {
                "effect": "NoExecute",
                "key": "node.kubernetes.io/unreachable",
                "operator": "Exists",
                "tolerationSeconds": 60
            }
        ]
    }