Help Center/ Cloud Container Engine/ User Guide/ O&M/ Health Center/ Diagnosis Items and Rectification Solutions

Updated on 2025-02-18 GMT+08:00

View PDF

Diagnosis Items and Rectification Solutions

Cluster Diagnosis Items and Rectification Solutions

Clusters

Scenario	Diagnosis Item	Enabling Monitoring Center	Rectification Solution
Cluster resource planning	Whether HA is enabled for the master node	Yes	The cluster has only one master node or a master node is abnormal. If another master node is faulty, the cluster is unavailable, affecting service reliability. To improve service resilience, use an HA cluster or rectify node exceptions. When a master node is faulty, the cluster is still available.
	Whether the cluster CPU request has exceeded 80%	Yes	A request is the minimum CPU or memory a workload needs. Plan required resources based on service requirements. For details, see Configuring Container Specifications.
	Whether the cluster memory request has exceeded 80%	Yes
	Whether the cluster version has reached the end of service	No	After the cluster version has reached the end of service, CCE will no longer support the creation of new clusters. You will also no longer be provided with technical support, including new feature updates, vulnerability or issue fixes, new patches, work order guidance, and online checks for the cluster version. The CCE SLA is not valid for such clusters. Go to the Clusters page to upgrade the cluster version. For details, see Process and Method of Upgrading a Cluster.
Cluster O&M	Whether the Cloud Native Cluster Monitoring add-on is normal	No	The Cloud Native Cluster Monitoring add-on provides one-stop cluster monitoring. Go to the Add-ons page to install this add-on and check its status. For details, see Cloud Native Cluster Monitoring.
	Whether the Cloud Native Log Collection add-on is normal	No	The Cloud Native Log Collection add-on collects and manages workload logs. Go to the Add-ons page to install this add-on and check its status.
	Whether the CCE Node Problem Detector add-on is normal	No	The CCE Node Problem Detector add-on monitors the nodes. Go to the Add-ons page to install this add-on and check its status. For details, see CCE Node Problem Detector.
Cluster configuration	Whether the security group is correctly configured	No	Invalid cluster security group configuration makes it impossible for the nodes to communicate with each other. Retain the default security group configuration.

Core Add-ons

Scenario	Diagnosis Item	Enabling Monitoring Center	Rectification Solution
CoreDNS status	Whether CoreDNS is normal	No	CoreDNS is a mandatory add-on that provides domain name resolution for clusters. If this add-on is not installed or is abnormal, the overall service response of the cluster will be affected. Go to the Add-ons page to install this add-on or check the add-on status.
	Whether the CPU usage of CoreDNS has exceeded 80% in the last 24 hours	Yes	CoreDNS provides domain name resolution for clusters. If the resource usage is too high, the add-on may be overloaded. Domain name resolution will be affected, and the latency is increased. To prevent services from being affected, analyze the recent QPS of CoreDNS. Go to Monitoring Center, click the Dashboard tab, and select the CoreDNS view to view the instance metrics. If the metric values reach the thresholds, adjust the specifications.
	Whether the memory usage of CoreDNS has exceeded 80% in the last 24 hours	Yes
	Whether CoreDNS failed to resolve domain names in the last 24 hours	Yes	If CoreDNS failed to resolve domain names, services are affected.
	Whether the P99 latency of CoreDNS has exceeded 5s in the last 24 hours	Yes	If the latency increases, responses to DNS requests become slow.
CCE Container Storage (Everest) status	Whether Everest is normal	No	Everest is a mandatory add-on that provides cloud storage services for clusters. If this add-on is not installed or is abnormal, the cluster storage capability is affected. Go to the Add-ons page to install this add-on or check the add-on status.
	Whether the CPU usage of everest-controller has exceeded 80% in the last 24 hours	Yes	Everest provides cloud storage services for clusters. If the resource usage is too high, the add-on may be overloaded, and cluster cloud storage is affected. To prevent cloud storage from being affected, analyze the recent load of everest-controller. Choose Monitoring Center > Workloads to view the Everest instance metrics. If the metric values reach the thresholds, adjust the specifications. For details, see CCE Container Storage (Everest).
	Whether the memory usage of everest-controller has exceeded 80% in the last 24 hours	Yes
Cloud Native Cluster Monitoring status	Whether Cloud Native Cluster Monitoring is normal	No	The Cloud Native Cluster Monitoring add-on provides one-stop cluster monitoring. Go to the Add-ons page to install this add-on and check its status.
	Whether the CPU usage of the prometheus workload has exceeded 80% in the last 24 hours	Yes	The Cloud Native Cluster Monitoring add-on provides cluster monitoring. If the resource usage is too high, this add-on may be overloaded, and cluster monitoring is affected. Choose Monitoring Center > Workloads to view the prometheus instance metrics. If the metric values reach the thresholds, adjust the specifications. NOTE: The PVC resource usage is checked when this add-on is deployed with local data storage enabled. In this mode, the collected metrics are stored in the cluster PV.
	Whether the memory usage of the prometheus workload has exceeded 80% in the last 24 hours	Yes
	Whether the PVC usage of prometheus-server exceeded 80% when the prometheus workload is deployed in server mode	Yes
	Whether OOM has occurred for the prometheus workload in the last 24 hours	No	The Cloud Native Cluster Monitoring add-on provides cluster monitoring. OOM occurs when the memory usage of the add-on instance reaches the limit. As a result, metric reporting will be affected, and non-HA cluster monitoring will be unavailable. Adjust the specifications of the prometheus instance.
CCE Cluster Autoscaler status	Whether the CCE Cluster Autoscaler add-on is available when auto scaling is enabled for node pools	No	The CCE Cluster Autoscaler add-on provides auto scalability for clusters. If this add-on is abnormal, auto scaling that has been enabled for a node pool becomes unavailable. Check the add-on status on the Add-ons page. NOTE: The add-on status is checked only when auto scaling is enabled for node pools.
Cloud Native Log Collection status	Whether the Cloud Native Log Collection add-on is normal	No	The Cloud Native Log Collection add-on collects and manages workload logs. Go to the Add-ons page to install this add-on and check its status.
Cloud Native Log Collection status	Whether default LTS log group and log streams are created	No	The default event log group and log streams are the basic units for event reporting in Monitoring Center. If there are no log group and log streams, event reporting is unavailable. For details about how to create a log group and log streams, see Collecting Container Logs Using Cloud Native Log Collection.

Nodes

Scenario	Diagnosis Item	Enabling Monitoring Center	Rectification Solution
Node status	Whether nodes are ready	Yes	If a node is not ready, services running on the node may be affected. Rectify the fault in a timely manner.
	Whether nodes can be scheduled	Yes	If a node cannot be scheduled, node resources cannot be used. Go to the CCE node management page to check whether the node status meets the expectation.
	Whether kubelet is normal	Yes	kubelet is a key component of the nodes. If kubelet is abnormal, the nodes may be abnormal and the pod status is inconsistent with that on the API server. Run the journalctl -l -u kubelet command on each node to view the kubelet log and locate the cause.
Node configuration	Whether the memory requests of pods on a node have exceeded 80% of the node memory	Yes	The minimum CPU and memory requested by a node determine whether new applications can be scheduled to the node. If the request is higher than the available resource, no applications will be scheduled to the node. The results show that the resource requests have exceeded the minimum values. Plan required resources for your applications based on the results.
Node configuration	Whether the CPU requests of pods on a node have exceeded 80% of the node CPU	Yes
Resource usage of nodes	Whether the CPU usage of a node has exceeded 80% in the last 24 hours	Yes	If the node CPU usage is too high, the workloads running on the node will be affected. Go to Monitoring Center to view the node CPU usage. Then plan required node resources or expand the node capacity.
	Whether the memory usage of a node has exceeded 80% in the last 24 hours	Yes	If the memory usage of a node is too high, there are OOM risks, affecting service availability on the node. Go to Monitoring Center to view the node memory usage. Then plan required node resources or expand the node capacity.
	Whether the disk usage of a node has exceeded 80%	Yes	If the node disk usage is too high, the pods will be affected. Expand capacity in a timely manner. Run the following commands to view disk details: lsblk: information about all available block devices df -h: available disk space of each mounted disk fdisk -l: all partitions
	Whether the number of PIDs for a node exceeds the limit	Yes	The node is experiencing PID pressure, making it unstable. Release unnecessary processes on the node or modify the PID limit. Run the following commands to view PID details: sysctl kernel.pid_max: the maximum number of PIDs ps -eLf\|awk '{print $2}' \| sort -rn\| head -n 1: the current maximum PID ps -elT \| awk '{print $4}' \| sort \| uniq -c \| sort -k1 -g \| tail -5: the top five processes that occupy the most SPIDs
	Whether OOM has occurred on a node in the last 24 hours	Yes	If OOM occurs on a node, service functions on the node are affected. Go to Monitoring Center to view the node memory. Then plan required resources or expand the capacity.

Workloads

Scenario	Diagnosis Item	Enabling Monitoring Center	Rectification Solution
Pod status	Whether pods are normal	No	If a pod fails to function normally, the workload performance for that pod may deteriorate. If there are no replicas available, the pod may be inaccessible. Run the following commands to view pod details: kubectl get pod <PodName> -n <Namespace> -o yaml: pod configuration kubectl describe pod <PodName> -n <Namespace>: pod events kubectl logs <PodName> -n<Namspace> -c <ContainerName>: container logs
Pod workload	Whether OOM has occurred on a pod in the last 24 hours	No	If OOM occurs on a pod, service functions of the pod are affected. Go to Monitoring Center to view the pod memory and adjust the workload specifications.
	Whether the CPU usage of a pod has exceeded 80% in the last 24 hours	Yes	If the resource usage is too high, the pod may be overloaded. This increases the latency and slows down service responses. Choose Monitoring Center > Pods to view the instance metrics. If the metric values reach the thresholds, adjust the container specifications.
	Whether the memory usage of a pod has exceeded 80% in the last 24 hours	Yes
Pod configuration	Whether requests are configured for containers in a pod	No	If requests are not configured, Scheduler will be affected, and pods may be scheduled to nodes whose resources cannot meet requirements. High requests will also reduce the resource usage of nodes.
Pod probe configuration	Whether liveness probes are configured for containers in a pod	No	If no liveness probes are configured, application exceptions in a pod cannot be detected, and the pod cannot be restarted in a timely manner, which will affect the QoS. Configure liveness probes for the pod to avoid abnormal applications and restart the pod in a timely manner if applications fail to function normally.
Pod probe configuration	Whether readiness probes are configured for containers in a pod	No	If no readiness probes are configured, requests are still sent to the pod even if it becomes abnormal, which will affect the QoS. Configure readiness probes for the pod so that requests can still be handled even if applications are abnormal.

External Dependencies

Scenario	Diagnosis Item	Enabling Monitoring Center	Rectification Solution
Resource quotas of a node	Whether 90% or more of the EVS disk quota has been used	Yes	Sufficient resource quotas are required for node creation in a cluster. If there are insufficient resource quotas, choose Resources > My Quotas and contact customer service to apply for account quotas.
Resource quotas of a node	Whether 90% or more of the ECS quota has been used	Yes

Pod Diagnosis Items and Rectification Solutions

**Table 1** Pod diagnosis items and rectification solutions
Diagnosis Item		Rectification Solution
FailedScheduling	Insufficient memory	The available memory of the node is insufficient. Expand the memory capacity.
	Insufficient cpu	The available CPU of the node is insufficient. Expand the CPU capacity.
	skip schedule deleting pod	The pod is being deleted.
	Other information	If the pod fails to be scheduled, view the pod information. kubectl describe <pod-name>
FailedAttachVolume		Check the status of the Everest add-on and node network connection, and ensure that the node has required permissions.
FailedMount		Check the status of the Everest add-on and node network connection, and ensure that the node has required permissions.
InvalidDiskCapacity		Check the disk capacity of the node and the actual available space. Ensure that the disk capacity is correctly set and meets the storage requirements of applications or services. Delete unnecessary files to release disk space. If a dynamic volume is used, ensure that the storage backend configuration is correct and available. Expand the disk capacity or adjust the storage requirements of applications or services as needed.
BackOffPullImage		Ensure that the image tag is correct.
FailedPullImage		Ensure that the image tag is correct.
ErrImageNeverPull		Check the local image. You are advised to set the image pull policy to IfNotPresent or Always.
InspectFailed		Check the integrity of the image.
FailedPostStartHook		Check the configuration and script of the post-start hook to ensure that they are correct. View the hook execution log to obtain the error information and rectify the fault in the hook script based on the error information. If possible, manually execute the post-start hook script to check whether the environment or permissions are correct.
FailedPreStopHook		Check the configuration and script of the pre-stop hook to ensure that they are correct. View the hook execution log to obtain the error information and rectify the fault in the hook script based on the error information. If possible, manually execute the pre-stop hook script to check whether the environment or permissions are correct.
ProbeWarning		Check the probe configuration to ensure that the probe is correctly configured and can correctly evaluate the container health status. View the alarm information to find the possible faults, and adjust the probe configuration or rectify the faults in the container as needed.
Unhealthy		Check the pod or container logs to find error information. Ensure that applications or services are correctly started and running in the container. Check the container resource usage to determine whether resources are insufficient. Take measures based on logs and monitoring information, such as restarting pods or containers to rectify application or service faults.
FailedCreatePodContainer		Check the pod and container configurations to ensure that the YAML file is correct, including the container image, resource request, and limit.
Preempting		You are advised to set proper resource requests and limits for the load to prevent preemption caused by insufficient resources.
Killing		Check the resource usage and ensure that the resource requests and limits of pods and nodes are properly set to prevent containers from being terminated due to insufficient resources.

Parent Topic: Health Center

Previous topic: Workload Diagnosis

Next topic: Monitoring Center

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel

For any further questions, feel free to contact us through the chatbot.