A Reset Node Cannot Be Used
Symptom
If the CCE cluster of ModelArts Lite has only one node in the resource pool and Volcano is set as the default scheduler, the node cannot be used after being reset on ModelArts. As a result, pods on the node fail to be scheduled.
Possible Causes
After a node is reset on ModelArts, modelarts-os adds an admission taint to the node for node admission. However, Volcano in the cluster does not support taint tolerance and there is only one node in the cluster. As a result, Volcano cannot be started, the maos-node-agent container that manages taints on the modelarts-os node cannot be started, and the taint cannot be automatically cleared.
Solution
- (Recommended) Solution 1 (using the Volcano scheduler as required):
- Change the default scheduler to kube-scheduler on the CCE console.
- Delete the pod of maos-node-agent (restart the pod).
- Delete taint A200008 from the node on the CCE console.
- Reset the node on the ModelArts console.
Disadvantage: When creating a workload, you need to manually specify Volcano as the scheduler. For details, see the user guide.
- Solution 2 (Volcano scheduler used by default):
- Change the default scheduler to kube-scheduler in the configuration center on the CCE console.
- Delete the pod of maos-node-agent (restart the pod).
- Delete taint A200008 from the node on the CCE console.
- Reset the node on the ModelArts console.
- Change the default scheduler to volcano in the configuration center on the CCE console.
Disadvantage: If you perform operations on the node on ModelArts, such as resetting or upgrading the driver, the node may fail to be started.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot