How Do I Prevent a Non-GPU or Non-NPU Workload from Being Scheduled to a GPU or NPU Node?
Symptom
If there are GPU/NPU nodes and other types of nodes running in your cluster, the non-NPU/GPU workloads may be scheduled to the GPU/NPU nodes. In this case, the GPU/NPU resources cannot be used properly.
Possible Causes
The non-GPU/non-NPU workloads use the vCPUs and memory provided by the GPU or NPU nodes. The scheduler may schedule the non-GPU/NPU workloads to these nodes, even if the workloads do not claim to use the GPU/NPU nodes. This may result in the idle GPU/NPU resources.
Solution
Add taints to the GPU/NPU nodes and configure tolerations to prevent non-GPU/NPU workloads from being scheduled to these nodes.
- For the GPU/NPU workloads, add tolerations so that they can be scheduled to the GPU/NPU nodes.
- For the non-GPU/NPU workload, if tolerations are not configured, they cannot be scheduled to the GPU/NPU nodes.
The procedure is as follows:
- Log in to the CCE console and click the cluster name to access the cluster console.
- In the navigation pane, choose Nodes. Click the Nodes tab, select a GPU/NPU node, and click Labels and Taints above the list.
- Click Add Operation under Batch Operation and add a taint to the node.
Select Taint. Enter the key and value and select the taint effect. The following example shows how to add the accelerator=true:NoSchedule taint to the GPU or NPU nodes.
Figure 1 Adding a taint
- When creating a GPU/NPU workload, manually add a toleration in the Advanced Settings area.
Figure 2 Adding a toleration
- When creating a non-GPU/NPU workload, do not add any tolerations. This workload will not be scheduled to the GPU/NPU nodes.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot