What Should I Do If a Cluster Is Available But Some Nodes Are Unavailable?
If the cluster status is available but some nodes in the cluster are unavailable, perform the following operations to rectify the fault:
Fault Locating
Troubleshooting methods are sorted based on the occurrence probability of the possible causes. You are advised to check the possible causes from high probability to low probability to quickly locate the cause of the problem.
If the fault persists after a possible cause is rectified, check other possible causes.
- Check Item 1: Whether the Node Is Overloaded
- Check Item 2: Whether the ECS Is Deleted or Faulty
- Check Item 3: Whether You Can Log In to the ECS
- Check Item 4: Whether the Security Group Is Modified
- Check Item 5: Whether the Security Group Rules Contain the Security Group Policy for the Communication Between the Master Node and the Worker Node
- Check Item 6: Whether the Disk Is Abnormal
- Check Item 7: Whether Internal Components Are Normal
- Check Item 8: Whether the DNS Address Is Correct
- Check Item 9: Whether the vdb Disk on the Node Is Deleted
- Check Item 10: Whether the Docker Service Is Normal
Check Item 1: Whether the Node Is Overloaded
Symptom
The node connection in the cluster is abnormal. Multiple nodes report write errors, but services are not affected.
Fault Locating
- Log in to the CCE console. In the navigation pane, choose Resource Management > Nodes.
- Click the name of an unavailable node to go to the node details page.
- On the Monitoring tab page, click View Monitoring Details to go to the AOM console and view historical monitoring records. Figure 2 Host Monitoring - View Monitor Graphs
A too high CPU or memory usage of the node will result in a high network latency or trigger system OOM. Therefore, the node is displayed as unavailable.
Solution
- You are advised to migrate services to reduce the workloads on the node and set the resource upper limit for the workloads.
- Clear data on the CCE nodes in the cluster.
- Limit the CPU and memory quotas of each container.
- Add more nodes to the cluster.
- You can also restart the node on the ECS console. For details, see How Can I Delete or Restart an ECS?
- Add nodes to deploy memory-intensive containers separately.
- Reset the node. For details, see Resetting a Node.
After the node becomes available, the workload is restored.
Check Item 2: Whether the ECS Is Deleted or Faulty
- Check whether the cluster is available.
Log in to the CCE console, and choose Resource Management > Clusters in the navigation pane. On the page displayed, check whether the cluster is available.
- If the cluster is unavailable, contact technical support by submitting a service ticket to rectify the fault.
- If the cluster is available but some nodes in the cluster are unavailable, go to 2.
- Log in to the ECS console. In the navigation pane, choose Elastic Cloud Server to view the ECS status.
- If the ECS status is Deleted, go back to the CCE console, choose Resource Management > Nodes in the navigation pane, delete the corresponding node, and then create another one.
- If the ECS status is Stopped or Frozen, restore the ECS first. It takes about 3 minutes to restore the ECS.
- If the ECS status is Faulty, restart the ECS. If the ECS is still faulty, contact technical support by submitting a service ticket to rectify the fault.
- If the ECS status is Running, log in to the ECS to locate the fault according to Check Item 7: Whether Internal Components Are Normal.
Check Item 3: Whether You Can Log In to the ECS
- Log in to the HUAWEI CLOUD management console. Choose Service List > Computing > Elastic Cloud Server.
- In the ECS list, locate the newly created node (generally named in the format of Cluster name-Random number) in the cluster and click Remote Login in the Operation column.
- Check whether the node name displayed on the page is the same as that on the VM and whether the password or key can be used to log in to the node. Figure 3 Checking the node name displayed on the page
Figure 4 Checking the node name on the VM and whether the node can be logged in to
If the node names are inconsistent and the password and key cannot be used to log in to the node, Cloud-Init problems occurred when an ECS was created. In this case, restart the node and submit a service ticket to the ECS personnel to locate the root cause.
Check Item 4: Whether the Security Group Is Modified
- Log in to the management console, and choose Service List > Network > Virtual Private Cloud. In the navigation pane, choose Access Control > Security Groups, and locate the security group of the master node.
The name of this security group is in the format of Cluster name-cce-control-ID, as shown in the following figure.
You can search for the security group by cluster name.Figure 5 Master node in the cluster
- Click the security group. On the details page displayed, ensure that the security group rules of the master node are the same as those in the following figure. Figure 6 Viewing inbound rules of the security group
Inbound rule parameter description:
- 4789: used for network access between containers.
- 5443: used by the kubelet of the worker node to listen on the kube-api of the master node.
- 5444: used by kube-dns.
- 4003/9443: used by the canal of the worker node to listen on the canal-api of the master node.
- 8445: used by storage_driver of the node to access csms-storagemgr of the master node.
Figure 7 Viewing outbound rules of the security group
Check Item 5: Whether the Security Group Rules Contain the Security Group Policy for the Communication Between the Master Node and the Worker Node
Check whether such a security group policy exists.
When a node is added to an existing cluster, if an extended CIDR block is added to the VPC corresponding to the subnet and the subnet is an extended CIDR block, you need to add the following three security group rules to the master node security group (the group name is in the format of Cluster name-cce-control-Random number). These rules ensure that the nodes added to the cluster are available. (This step is not required if an extended CIDR block has been added to the VPC during cluster creation.) In the following figure, the Source column lists the node CIDR blocks.

Check Item 6: Whether the Disk Is Abnormal
After a node is created in a cluster of v1.7.3-r7 or a later version, a 100 GB data disk dedicated for Docker is bound to the node. If the data disk is uninstalled or damaged, the Docker service becomes abnormal and the node becomes unavailable.
Click the node name to check whether the data disk mounted to the node is uninstalled. If the disk is uninstalled, mount a data disk to the node again and restart the node. Then the node can be recovered.
Check Item 7: Whether Internal Components Are Normal
- Log in to the ECS where the unavailable node is located.
For details, see Logging In to an ECS.
- Run the following command to check whether the PaaS components are normal:
For version 1.13, run the following command:
systemctl status kubelet
If this command fails to be run, contact technical support by submitting a service ticket. If this command is successfully executed, the status of each component is displayed as active, as shown in the following figure.

If the component status is not active, run the following commands (using the faulty component canal as an example):
Run systemctl restart canal to restart the component.
After restarting the component, run systemctl status canal to check the status.
For versions earlier than v1.13, run the following command:
su paas -c '/var/paas/monit/bin/monit summary'
If this command fails to be run, contact technical support by submitting a service ticket. If this command is successfully executed, the status of each component is displayed, as shown in the following figure.

If any service component is not in the Running state, restart the corresponding service. For example, the canal component is abnormal, as shown in the following figure.

Run su paas -c '/var/paas/monit/bin/monit restart canal' to restart the canal component.
After the restart, run su paas -c '/var/paas/monit/bin/monit summary' to query the status of the canal component.
In that case, the status of each component is Running, as shown in the following figure.

- If the restart command fails to be run, run the following command to check the running status of the monitrc process:
ps -ef | grep monitrc
- If the monitrc process exists, run the following command to kill this process. The monitrc process will be automatically restarted after it is killed.
kill -s 9 `ps -ef | grep monitrc | grep -v grep | awk '{print $2}'`
- If the monitrc process does not exist or is not restarted after being killed, contact technical support by submitting a service ticket.
- If the monitrc process exists, run the following command to kill this process. The monitrc process will be automatically restarted after it is killed.
- If the fault persists, collect logs in the /var/log/messages and /var/paas/sys/log directories, submit a service ticket, and contact Huawei technical support by submitting a service ticket.
Check Item 8: Whether the DNS Address Is Correct
- Log in to the HUAWEI CLOUD management console. In the service list on the top of the page, choose Computing > Elastic Cloud Server.
- In the ECS list, locate the newly created node (generally named in the format of Cluster name-Random number) in the cluster and click Remote Login in the Operation column.
- After logging in to the node, check the /var/log/cloud-init-output.log file. As shown in the following figure, domain name resolution failure is recorded.
cat /var/log/cloud-init-output.log | grep resolv
If the command output contains the following information, the domain name cannot be resolved:
Could not resolve host: cce-north-4.obs.cn-north-4.myhuaweicloud.com; Unknown error
- Run the following command to check whether the domain name can be resolved on the node:
ping cce-north-4.obs.cn-north-4.myhuaweicloud.com
- If not, the DNS cannot resolve the IP address. Check whether the DNS address in the /etc/resolv.conf file is the same as that configured on the VPC subnet. In most cases, the DNS address in the file is incorrectly configured. As a result, the domain name cannot be resolved. Correct the DNS configuration of the VPC subnet and reset the node.
- If yes, the DNS address configuration is correct. Check whether there are other faults.
Check Item 9: Whether the vdb Disk on the Node Is Deleted
If the vdb disk on a node is deleted, you can refer to this topic to restore the node.
Check Item 10: Whether the Docker Service Is Normal
- Run the following command to check whether the Docker service is running:
systemctl status docker

If the command fails or the Docker service status is not active, locate the cause or contact technical support if necessary.
- Run the following command to check the number of containers on the node:
docker ps -a | wc -l
If the command is suspended, the command execution takes a long time, or there are more than 1000 abnormal containers, check whether workloads are repeatedly created and deleted. If a large number of containers are frequently created and deleted, a large number of abnormal containers may occur and cannot be cleared in a timely manner.
In this case, stop repeated creation and deletion of the workload or use more nodes to share the workload. Generally, the nodes will be restored after a period of time. If necessary, run the docker rm {container_id} command to manually clear abnormal containers.
Did this article solve your problem?
Thank you for your score!Your feedback would help us improve the website.