What Should I Do If Container Startup Fails?
Fault Locating
On the details page of a workload, if an event is displayed indicating that the container fails to be started, perform the following steps to locate the fault:
- Log in to the node where the abnormal workload is located.
- Check the ID of the container where the workload pod exits abnormally.
docker ps -a | grep $podName
- View the logs of the corresponding container.
docker logs $containerID
Rectify the fault of the workload based on logs.
- Check the error logs.
cat /var/log/messages | grep $containerID | grep oom
Check whether the system OOM is triggered based on the logs.
Troubleshooting Process
Determine the cause based on the event information, as listed in Table 1.
Log or Event |
Cause and Solution |
---|---|
The log contains exit(0). |
No process exists in the container. Check whether the container is running properly. Check Item 1: Whether There Are Processes that Keep Running in the Container (Exit Code: 0) |
Event information: Liveness probe failed: Get http... The log contains exit(137). |
Health check fails. Check Item 2: Whether Health Check Fails to Be Performed (Exit Code: 137) |
Event information: Thin Pool has 15991 free data blocks which are less than minimum required 16383 free data blocks. Create more free space in thin pool or use dm.min_free_space option to change behavior |
The disk space is insufficient. Clear the disk space. Check Item 3: Whether the Container Disk Space Is Insufficient |
The keyword OOM exists in the log. |
The memory is insufficient. Check Item 4: Whether the Upper Limit of Container Resources Has Been Reached Check Item 5: Whether the Resource Limits Are Improperly Configured for the Container |
Address already in use |
A conflict occurs between container ports in the pod. Check Item 6: Whether the Container Ports in the Same Pod Conflict with Each Other |
In addition to the preceding possible causes, there are some other possible causes:
- Check Item 7: Whether the Container Startup Command Is Correctly Configured
- Check Item 8: Whether the User Service Has a Bug
- Use the correct image when you create a workload on an Arm node.
Check Item 1: Whether There Are Processes that Keep Running in the Container (Exit Code: 0)
- Log in to the node where the abnormal workload is located.
- View the container status.
docker ps -a | grep $podName
Example:
If no running process exists in the container, the status code Exited (0) is displayed.
Check Item 2: Whether Health Check Fails to Be Performed (Exit Code: 137)
The health check configured for a workload is performed on services periodically. If an exception occurs, the pod reports an event and the pod fails to be restarted.
If the liveness-type (workload liveness probe) health check is configured for the workload and the number of health check failures exceeds the threshold, the containers in the pod will be restarted. On the workload details page, if Kubernetes events contain Liveness probe failed: Get http..., the health check fails.
Solution
Click the workload name to go to the workload details page, click the Containers tab. Then select Health Check to check whether the policy is proper or whether services are running properly.
Check Item 3: Whether the Container Disk Space Is Insufficient
The following message refers to the thin pool disk that is allocated from the Docker disk selected during node creation. You can run the lvs command as user root to view the current disk usage.
Thin Pool has 15991 free data blocks which are less than minimum required 16383 free data blocks. Create more free space in thin pool or use dm.min_free_space option to change behavior
Solution
Solution 1: Clearing images
- Nodes that use containerd
- Obtain local images on the node.
crictl images -v
- Delete the images that are not required by image ID.
crictl rmi Image ID
- Obtain local images on the node.
- Nodes that use Docker
- Obtain local images on the node.
docker images
- Delete the images that are not required by image ID.
docker rmi Image ID
- Obtain local images on the node.
Do not delete system images such as the cce-pause image. Otherwise, pods may fail to be created.
Solution 2: Expanding the disk capacity
To expand a disk capacity, perform the following steps:
- Expand the capacity of the data disk on the EVS console.
- Log in to the CCE console and click the cluster. In the navigation pane, choose Nodes. Click More > Sync Server Data in the row containing the target node.
- Log in to the target node.
- Run the lsblk command to check the block device information of the node.
A data disk is divided depending on the container storage Rootfs:
- Overlayfs: No independent thin pool is allocated. Image data is stored in the dockersys disk.
# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 50G 0 disk └─sda1 8:1 0 50G 0 part / sdb 8:16 0 200G 0 disk ├─vgpaas-dockersys 253:0 0 90G 0 lvm /var/lib/docker # Space used by the container engine └─vgpaas-kubernetes 253:1 0 10G 0 lvm /mnt/paas/kubernetes/kubelet # Space used by Kubernetes
Run the following commands on the node to add the new disk capacity to the dockersys disk:
pvresize /dev/sdb lvextend -l+100%FREE -n vgpaas/dockersys resize2fs /dev/vgpaas/dockersys
- Devicemapper: A thin pool is allocated to store image data.
# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 50G 0 disk └─sda1 8:1 0 50G 0 part / sdb 8:16 0 200G 0 disk ├─vgpaas-dockersys 253:0 0 18G 0 lvm /var/lib/docker ├─vgpaas-thinpool_tmeta 253:1 0 3G 0 lvm │ └─vgpaas-thinpool 253:3 0 67G 0 lvm # Thin pool space. │ ... ├─vgpaas-thinpool_tdata 253:2 0 67G 0 lvm │ └─vgpaas-thinpool 253:3 0 67G 0 lvm │ ... └─vgpaas-kubernetes 253:4 0 10G 0 lvm /mnt/paas/kubernetes/kubelet
- Run the following commands on the node to add the new disk capacity to the thinpool disk:
pvresize /dev/sdb lvextend -l+100%FREE -n vgpaas/thinpool
- Run the following commands on the node to add the new disk capacity to the dockersys disk:
pvresize /dev/sdb lvextend -l+100%FREE -n vgpaas/dockersys resize2fs /dev/vgpaas/dockersys
- Run the following commands on the node to add the new disk capacity to the thinpool disk:
- Overlayfs: No independent thin pool is allocated. Image data is stored in the dockersys disk.
Check Item 4: Whether the Upper Limit of Container Resources Has Been Reached
If the upper limit of container resources has been reached, OOM will be displayed in the event details as well as in the log:
cat /var/log/messages | grep 96feb0a425d6 | grep oom
When a workload is created, if the requested resources exceed the configured upper limit, the system OOM is triggered and the container exits unexpectedly.
Check Item 5: Whether the Resource Limits Are Improperly Configured for the Container
If the resource limits set for the container during workload creation are less than required, the container fails to be restarted.
Check Item 6: Whether the Container Ports in the Same Pod Conflict with Each Other
- Log in to the node where the abnormal workload is located.
- Check the ID of the container where the workload pod exits abnormally.
docker ps -a | grep $podName
- View the logs of the corresponding container.
docker logs $containerID
Rectify the fault of the workload based on logs. As shown in the following figure, container ports in the same pod conflict. As a result, the container fails to be started.
Figure 2 Container restart failure due to a container port conflict
Solution
Re-create the workload and set a port number that is not used by any other pod.
Check Item 7: Whether the Container Startup Command Is Correctly Configured
The error messages are as follows:
Solution
Click the workload name to go to the workload details page, click the Containers tab. Choose Lifecycle , click Startup Command, and ensure that the command is correct.
Check Item 8: Whether the User Service Has a Bug
Check whether the workload startup command is correctly executed or whether the workload has a bug.
- Log in to the node where the abnormal workload is located.
- Check the ID of the container where the workload pod exits abnormally.
docker ps -a | grep $podName
- View the logs of the corresponding container.
docker logs $containerID
Note: In the preceding command, containerID indicates the ID of the container that has exited.
Figure 3 Incorrect startup command of the container
As shown in the figure above, the container fails to be started due to an incorrect startup command. For other errors, rectify the bugs based on the logs.
Solution
Create a new workload and configure a correct startup command.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot