Updated on 2024-07-02 GMT+08:00

Diagnosis Items and Rectification Solutions



Diagnosis Item

Enabling Monitoring Center

Rectification Solution

Cluster resource planning

Whether HA is enabled for master nodes


The cluster has only one master node. If the master node is faulty, the cluster is unavailable, affecting service reliability. Use an HA cluster to improve service resilience. When a master node is faulty, the cluster is still available.

Whether the CPU requests of pods in the cluster have exceeded 80% of the cluster CPU


A request is the minimum CPU or memory a workload needs. Plan required resources based on service requirements. For details, see Configuring Container Specifications.

Whether the memory requests of pods in the cluster have exceeded 80% of the cluster memory


Whether the cluster version has expired


After the cluster version has reached the end of service, new clusters cannot be created, and no technical support will be provided, including new feature updates, vulnerability or issue fixes, new patches, work order guidance, and online checks for the cluster version. Such clusters are not covered in the CCE SLA. Go to the Clusters page to upgrade the cluster version. For details, see Process and Method of Upgrading a Cluster.

Cluster O&M

Whether kube-prometheus-stack is normal


kube-prometheus-stack provides one-stop cluster monitoring. Go to the Add-ons page to install this add-on and check its status. For details, see Cloud Native Cluster Monitoring.

Whether log-agent is normal


log-agent collects and manages workload logs. Go to the Add-ons page to install this add-on and check the add-on status.

Whether npd is normal


node-problem-detector (npd) monitors the nodes. Go to the Add-ons page to install this add-on and check the add-on status. For details, see CCE Node Problem Detector.

Cluster configuration

Whether security groups are correctly configured


Invalid cluster security group configuration makes it impossible for the nodes to communicate with each other. Retain the default security group configuration.

Core add-ons


Diagnosis Item

Enabling Monitoring Center

Rectification Solution

coredns status

Whether coredns is normal


coredns is a mandatory add-on that provides domain name resolution for clusters. If this add-on is not installed or is abnormal, services in the cluster will be affected. Go to the Add-ons page to install this add-on or check the add-on status.

Whether the CPU usage of coredns has exceeded 80% in the last 24 hours


coredns provides domain name resolution for clusters. If the resource usage is too high, the add-on may be overloaded. Domain name resolution will be affected, and the latency is increased. To prevent services from being affected, analyze the recent QPS of coredns. Go to Monitoring Center, click the Dashboard tab, and select the CoreDNS view to view the instance metrics. If the metric values reach the thresholds, adjust the specifications.

Whether the memory usage of coredns has exceeded 80% in the last 24 hours


Whether coredns failed to resolve domain names in the last 24 hours


If coredns failed to resolve domain names, services are affected.

Whether the P99 latency of coredns has exceeded 5s in the last 24 hours


If the latency increases, responses to DNS requests become slow.

everest status

Whether everest is normal


everest is a mandatory add-on that provides cloud storage services for clusters. If this add-on is not installed or is abnormal, the cluster storage capability is affected. Go to the Add-ons page to install this add-on or check the add-on status.

Whether the CPU usage of everest-controller has exceeded 80% in the last 24 hours


everest provides cloud storage services for clusters. If the resource usage is too high, the add-on may be overloaded, and cluster cloud storage is affected. To prevent cloud storage from being affected, analyze the recent load of everest-controller. Choose Monitoring Center > Workloads to view the instance metrics. If the metric values reach the thresholds, adjust the specifications. For details, see the everest parameters in "Installing the Add-on" of everest.

Whether the memory usage of everest-controller has exceeded 80% in the last 24 hours


kube-prometheus-stack status

Whether kube-prometheus-stack is normal


kube-prometheus-stack provides one-stop cluster monitoring. Go to the Add-ons page to install this add-on or check the add-on status.

Whether the CPU usage of the prometheus workload has exceeded 80% in the last 24 hours


kube-prometheus-stack provides cluster monitoring. If the resource usage is too high, kube-prometheus-stack may be overloaded, and cluster monitoring is affected. Choose Monitoring Center > Workloads to view the instance metrics. If the metric values reach the thresholds, adjust the specifications.


The PVC resource usage is checked when kube-prometheus-stack is deployed in server mode. In server mode, collected metrics data is stored in the cluster PV.

Whether the memory usage of the prometheus workload has exceeded 80% in the last 24 hours


Whether the PVC usage of prometheus-server exceeded 80% when the prometheus workload is deployed in server mode


Whether OOM has occurred for the prometheus workload in the last 24 hours


kube-prometheus-stack provides cluster monitoring. OOM occurs when the memory usage of the add-on instance reaches the limit. As a result, metric reporting will be affected, and non-HA cluster monitoring will be unavailable. Adjust the specifications of the prometheus instance.

autoscaler status

Whether autoscaler is available when auto scaling is enabled for node pools


autoscaler provides auto scaling for clusters. If autoscaler is abnormal, atuo scaling that has been enabled for a node pool becomes unavailable. Check the add-on status on the Add-ons page.


The autoscaler status is checked only when auto scaling is enabled for node pools.

log-agent status

Whether log-agent is normal


log-agent collects and manages workload logs. Go to the Add-ons page to install this add-on or check the add-on status.

Whether default LTS log group and log streams are created


The default event log group and log streams are the basic units for event reporting in Monitoring Center. If there are no log group and log streams, event reporting is unavailable. For details about how to create a log group and log streams, see Collecting Container Logs Using Cloud Native Logging.



Diagnosis Item

Enabling Monitoring Center

Rectification Solution

Node status

Whether nodes are ready


If a node is not ready, services running on the node may be affected. Rectify the fault in a timely manner.

Whether nodes can be scheduled


If a node cannot be scheduled, node resources cannot be used. Go to the CCE node management page to check whether the node status meets the expectation.

Whether kubelet is normal


kubelet is a key component of the nodes. If kubelet is abnormal, the nodes may be abnormal and the pod status is inconsistent with that on the API server. Run the journalctl -l -u kubelet command on each node to view the kubelet log and locate the cause.

Node configuration

Whether the memory requests of pods on a node have exceeded 80% of the node memory


The minimum CPU and memory requested by a node determine whether new applications can be scheduled to the node. If the request is higher than the available resource, no applications will be scheduled to the node. The results show that the resource requests have exceeded the minimum values. Plan required resources for your applications based on the results.

Whether the CPU requests of pods on a node have exceeded 80% of the node CPU


Resource requests and limits of nodes

Whether the CPU usage of a node has exceeded 80% in the last 24 hours


If the node CPU usage is too high, the workloads running on the node will be affected. Go to Monitoring Center to view the node CPU usage. Then plan required node resources or expand the node capacity.

Whether the memory usage of a node has exceeded 80% in the last 24 hours


If the memory usage of a node is too high, there are OOM risks, affecting service availability on the node. Go to Monitoring Center to view the node memory usage. Then plan required node resources or expand the node capacity.

Whether the disk usage of a node has exceeded 80%


If the node disk usage is too high, the pods will be affected. Expand capacity in a timely manner. Run the following commands to view disk details:

  • lsblk: information about all available block devices
  • df -h: available disk space of each mounted disk
  • fdisk -l: all partitions

Whether the number of PIDs for a node exceeds the limit


The node is experiencing PID pressure, and the node may become unstable. Release unnecessary processes on the node or modify the PID limit in a timely manner. Run the following commands to view PID details:

  • sysctl kernel.pid_max: the maximum number of PIDs
  • ps -eLf|awk '{print $2}' | sort -rn| head -n 1: the current maximum PID
  • ps -elT | awk '{print $4}' | sort | uniq -c | sort -k1 -g | tail -5: the top five processes that occupy the most SPIDs

Whether OOM has occurred on a node in the last 24 hours


If OOM occurs on a node, service functions on the node are affected. Go to Monitoring Center to view the node memory. Then plan required resources or expand the capacity.



Diagnosis Item

Enabling Monitoring Center

Rectification Solution

Pod status

Whether pods are normal


If a pod fails to function normally, the workload performance for that pod may deteriorate. If there are no replicas available, the pod may be inaccessible. Run the following commands to view pod details:

  • kubectl get pod <PodName> -n <Namespace> -o yaml: pod configuration
  • kubectl describe pod <PodName> -n <Namespace>: pod events
  • kubectl logs <PodName> -n<Namspace> -c <ContainerName>: container logs

Pod workload

Whether OOM has occurred on a pod in the last 24 hours


If OOM occurs on a pod, service functions of the pod are affected. Go to Monitoring Center to view the pod memory and adjust the workload specifications.

Whether the CPU usage of a pod has exceeded 80% in the last 24 hours


If the resource usage is too high, the pod may be overloaded. This increases the latency and slows down service responses. Choose Monitoring Center > Pods to view the instance metrics. If the metric values reach the thresholds, adjust the container specifications.

Whether the memory usage of a pod has exceeded 80% in the last 24 hours


Pod configuration

Whether requests are configured for containers in a pod


If requests are not configured, Scheduler will be affected, and pods may be scheduled to nodes whose resources cannot meet requirements. High requests will also reduce the resource usage of nodes.

Pod probe configuration

Whether liveness probes are configured for containers in a pod


If no liveness probes are configured, application exceptions in a pod cannot be detected, and the pod cannot be restarted in a timely manner, which will affect the QoS. Configure liveness probes for the pod to avoid abnormal applications and restart the pod in a timely manner if applications fail to function normally.

Whether readiness probes are configured for containers in a pod


If no readiness probes are configured, requests are still sent to the pod even if it becomes abnormal, which will affect the QoS. Configure readiness probes for the pod so that requests can still be handled even if applications are abnormal.

External dependencies


Diagnosis Item

Enabling Monitoring Center

Rectification Solution

Resource quotas of a node

Whether 90% or more of the EVS disk quota has been used


Sufficient resource quotas are required for node creation in a cluster. If there are insufficient resource quotas, choose Resources > My Quotas and contact customer service to apply for account quotas.

Whether 90% or more of the ECS quota has been used
