Help Center/ Cloud Container Engine/ FAQs/ Workload/ Scheduling Policies/ What Should I Do If Pods Cannot Be Rescheduled After the Node Is Stopped?
Updated on 2024-09-04 GMT+08:00

What Should I Do If Pods Cannot Be Rescheduled After the Node Is Stopped?

Symptom

After a node is stopped, pods on the node are still running. The latest pod event obtained by running kubectl describe pod <pod-name> is displayed as follows:

Warning NodeNotReady 17s node-controller Node is not ready

Possible Causes

After a node is stopped, the system automatically adds taints to the node.

  • node.kubernetes.io/unreachable:NoExecute
  • node.cloudprovider.kubernetes.io/shutdown:NoSchedule
  • node.kubernetes.io/unreachable:NoSchedule
  • node.kubernetes.io/not-ready:NoExecute

If a pod has tolerations for these taints, it will not be rescheduled. Therefore, check the tolerations of the pod.

Solution

Check the tolerations by viewing the YAML file of the pod or workload. The tolerations of a workload consist of the following fields:

tolerations: 
- key: "key1"
  operator: "Equal"
  value: "value1" 
  effect: "NoSchedule"

Or:

tolerations: 
- key: "key1"
  operator: "Exists"
  effect: "NoSchedule"

If the preceding tolerations are incorrectly configured, the scheduling may fail. For example:

tolerations:
- operator: "Exists"

In this example, the operator parameter is set to Exists. In this case, the value parameter cannot be configured.

  • If the operator parameter of a toleration is set to Exists but the key parameter is empty, the toleration can match any key, value, and effect. It can tolerate any taint.
  • If the effect parameter of a toleration is empty but the key parameter is configured, the toleration can match the effects of all keys.

For details, see Taints and Tolerations.

Restore the default tolerations configuration by modifying the YAML file of the workload as follows:

      tolerations:
        - key: node.kubernetes.io/not-ready
          operator: Exists
          effect: NoExecute
          tolerationSeconds: 300
        - key: node.kubernetes.io/unreachable
          operator: Exists
          effect: NoExecute
          tolerationSeconds: 300

This default toleration indicates that the pod can run on the node with the preceding taints for 300s and then be evicted.