Help Center/ Cloud Container Engine/ FAQs/ Workload/ Scheduling Policies/ How Do I Prevent a Non-GPU or Non-NPU Workload from Being Scheduled to a GPU or NPU Node?

Updated on 2024-09-30 GMT+08:00

View PDF

How Do I Prevent a Non-GPU or Non-NPU Workload from Being Scheduled to a GPU or NPU Node?

Symptom

If there are GPU/NPU nodes and other types of nodes running in your cluster, the non-NPU/GPU workloads may be scheduled to the GPU/NPU nodes. In this case, the GPU/NPU resources cannot be used properly.

Possible Causes

The non-GPU/non-NPU workloads use the vCPUs and memory provided by the GPU or NPU nodes. The scheduler may schedule the non-GPU/NPU workloads to these nodes, even if the workloads do not claim to use the GPU/NPU nodes. This may result in the idle GPU/NPU resources.

Solution

Add taints to the GPU/NPU nodes and configure tolerations to prevent non-GPU/NPU workloads from being scheduled to these nodes.

For the GPU/NPU workloads, add tolerations so that they can be scheduled to the GPU/NPU nodes.
For the non-GPU/NPU workload, if tolerations are not configured, they cannot be scheduled to the GPU/NPU nodes.

The procedure is as follows:

Log in to the CCE console and click the cluster name to access the cluster console.
In the navigation pane, choose Nodes. Click the Nodes tab, select a GPU/NPU node, and click Labels and Taints above the list.
Click Add Operation under Batch Operation and add a taint to the node.

Select Taint. Enter the key and value and select the taint effect. The following example shows how to add the accelerator=true:NoSchedule taint to the GPU or NPU nodes.

Figure 1 Adding a taint
When creating a GPU/NPU workload, manually add a toleration in the Advanced Settings area.

Figure 2 Adding a toleration
When creating a non-GPU/NPU workload, do not add any tolerations. This workload will not be scheduled to the GPU/NPU nodes.