Updated on 2024-07-02 GMT+08:00

Diagnosis Items and Rectification Solutions

Clusters

Scenario

Diagnosis Item

Enabling Monitoring Center

Rectification Solution

Cluster resource planning

Whether HA is enabled for master nodes

Yes

The cluster has only one master node. If the master node is faulty, the cluster is unavailable, affecting service reliability. Use an HA cluster to improve service resilience. When a master node is faulty, the cluster is still available.

Whether the CPU requests of pods in the cluster have exceeded 80% of the cluster CPU

Yes

A request is the minimum CPU or memory a workload needs. Plan required resources based on service requirements. For details, see Configuring Container Specifications.

Whether the memory requests of pods in the cluster have exceeded 80% of the cluster memory

Yes

Whether the cluster version has expired

No

After the cluster version has reached the end of service, new clusters cannot be created, and no technical support will be provided, including new feature updates, vulnerability or issue fixes, new patches, work order guidance, and online checks for the cluster version. Such clusters are not covered in the CCE SLA. Go to the Clusters page to upgrade the cluster version. For details, see Process and Method of Upgrading a Cluster.

Cluster O&M

Whether kube-prometheus-stack is normal

No

kube-prometheus-stack provides one-stop cluster monitoring. Go to the Add-ons page to install this add-on and check its status. For details, see Cloud Native Cluster Monitoring.

Whether log-agent is normal

No

log-agent collects and manages workload logs. Go to the Add-ons page to install this add-on and check the add-on status.

Whether npd is normal

No

node-problem-detector (npd) monitors the nodes. Go to the Add-ons page to install this add-on and check the add-on status. For details, see CCE Node Problem Detector.

Cluster configuration

Whether security groups are correctly configured

No

Invalid cluster security group configuration makes it impossible for the nodes to communicate with each other. Retain the default security group configuration.

Core add-ons

Scenario

Diagnosis Item

Enabling Monitoring Center

Rectification Solution

coredns status

Whether coredns is normal

No

coredns is a mandatory add-on that provides domain name resolution for clusters. If this add-on is not installed or is abnormal, services in the cluster will be affected. Go to the Add-ons page to install this add-on or check the add-on status.

Whether the CPU usage of coredns has exceeded 80% in the last 24 hours

Yes

coredns provides domain name resolution for clusters. If the resource usage is too high, the add-on may be overloaded. Domain name resolution will be affected, and the latency is increased. To prevent services from being affected, analyze the recent QPS of coredns. Go to Monitoring Center, click the Dashboard tab, and select the CoreDNS view to view the instance metrics. If the metric values reach the thresholds, adjust the specifications.

Whether the memory usage of coredns has exceeded 80% in the last 24 hours

Yes

Whether coredns failed to resolve domain names in the last 24 hours

Yes

If coredns failed to resolve domain names, services are affected.

Whether the P99 latency of coredns has exceeded 5s in the last 24 hours

Yes

If the latency increases, responses to DNS requests become slow.

everest status

Whether everest is normal

No

everest is a mandatory add-on that provides cloud storage services for clusters. If this add-on is not installed or is abnormal, the cluster storage capability is affected. Go to the Add-ons page to install this add-on or check the add-on status.

Whether the CPU usage of everest-controller has exceeded 80% in the last 24 hours

Yes

everest provides cloud storage services for clusters. If the resource usage is too high, the add-on may be overloaded, and cluster cloud storage is affected. To prevent cloud storage from being affected, analyze the recent load of everest-controller. Choose Monitoring Center > Workloads to view the instance metrics. If the metric values reach the thresholds, adjust the specifications. For details, see the everest parameters in "Installing the Add-on" of everest.

Whether the memory usage of everest-controller has exceeded 80% in the last 24 hours

Yes

kube-prometheus-stack status

Whether kube-prometheus-stack is normal

No

kube-prometheus-stack provides one-stop cluster monitoring. Go to the Add-ons page to install this add-on or check the add-on status.

Whether the CPU usage of the prometheus workload has exceeded 80% in the last 24 hours

Yes

kube-prometheus-stack provides cluster monitoring. If the resource usage is too high, kube-prometheus-stack may be overloaded, and cluster monitoring is affected. Choose Monitoring Center > Workloads to view the instance metrics. If the metric values reach the thresholds, adjust the specifications.

NOTE:

The PVC resource usage is checked when kube-prometheus-stack is deployed in server mode. In server mode, collected metrics data is stored in the cluster PV.

Whether the memory usage of the prometheus workload has exceeded 80% in the last 24 hours

Yes

Whether the PVC usage of prometheus-server exceeded 80% when the prometheus workload is deployed in server mode

Yes

Whether OOM has occurred for the prometheus workload in the last 24 hours

No

kube-prometheus-stack provides cluster monitoring. OOM occurs when the memory usage of the add-on instance reaches the limit. As a result, metric reporting will be affected, and non-HA cluster monitoring will be unavailable. Adjust the specifications of the prometheus instance.

autoscaler status

Whether autoscaler is available when auto scaling is enabled for node pools

No

autoscaler provides auto scaling for clusters. If autoscaler is abnormal, atuo scaling that has been enabled for a node pool becomes unavailable. Check the add-on status on the Add-ons page.

NOTE:

The autoscaler status is checked only when auto scaling is enabled for node pools.

log-agent status

Whether log-agent is normal

No

log-agent collects and manages workload logs. Go to the Add-ons page to install this add-on or check the add-on status.

Whether default LTS log group and log streams are created

No

The default event log group and log streams are the basic units for event reporting in Monitoring Center. If there are no log group and log streams, event reporting is unavailable. For details about how to create a log group and log streams, see Collecting Container Logs Using Cloud Native Logging.

Nodes

Scenario

Diagnosis Item

Enabling Monitoring Center

Rectification Solution

Node status

Whether nodes are ready

Yes

If a node is not ready, services running on the node may be affected. Rectify the fault in a timely manner.

Whether nodes can be scheduled

Yes

If a node cannot be scheduled, node resources cannot be used. Go to the CCE node management page to check whether the node status meets the expectation.

Whether kubelet is normal

Yes

kubelet is a key component of the nodes. If kubelet is abnormal, the nodes may be abnormal and the pod status is inconsistent with that on the API server. Run the journalctl -l -u kubelet command on each node to view the kubelet log and locate the cause.

Node configuration

Whether the memory requests of pods on a node have exceeded 80% of the node memory

Yes

The minimum CPU and memory requested by a node determine whether new applications can be scheduled to the node. If the request is higher than the available resource, no applications will be scheduled to the node. The results show that the resource requests have exceeded the minimum values. Plan required resources for your applications based on the results.

Whether the CPU requests of pods on a node have exceeded 80% of the node CPU

Yes

Resource requests and limits of nodes

Whether the CPU usage of a node has exceeded 80% in the last 24 hours

Yes

If the node CPU usage is too high, the workloads running on the node will be affected. Go to Monitoring Center to view the node CPU usage. Then plan required node resources or expand the node capacity.

Whether the memory usage of a node has exceeded 80% in the last 24 hours

Yes

If the memory usage of a node is too high, there are OOM risks, affecting service availability on the node. Go to Monitoring Center to view the node memory usage. Then plan required node resources or expand the node capacity.

Whether the disk usage of a node has exceeded 80%

Yes

If the node disk usage is too high, the pods will be affected. Expand capacity in a timely manner. Run the following commands to view disk details:

  • lsblk: information about all available block devices
  • df -h: available disk space of each mounted disk
  • fdisk -l: all partitions

Whether the number of PIDs for a node exceeds the limit

Yes

The node is experiencing PID pressure, and the node may become unstable. Release unnecessary processes on the node or modify the PID limit in a timely manner. Run the following commands to view PID details:

  • sysctl kernel.pid_max: the maximum number of PIDs
  • ps -eLf|awk '{print $2}' | sort -rn| head -n 1: the current maximum PID
  • ps -elT | awk '{print $4}' | sort | uniq -c | sort -k1 -g | tail -5: the top five processes that occupy the most SPIDs

Whether OOM has occurred on a node in the last 24 hours

Yes

If OOM occurs on a node, service functions on the node are affected. Go to Monitoring Center to view the node memory. Then plan required resources or expand the capacity.

Workloads

Scenario

Diagnosis Item

Enabling Monitoring Center

Rectification Solution

Pod status

Whether pods are normal

No

If a pod fails to function normally, the workload performance for that pod may deteriorate. If there are no replicas available, the pod may be inaccessible. Run the following commands to view pod details:

  • kubectl get pod <PodName> -n <Namespace> -o yaml: pod configuration
  • kubectl describe pod <PodName> -n <Namespace>: pod events
  • kubectl logs <PodName> -n<Namspace> -c <ContainerName>: container logs

Pod workload

Whether OOM has occurred on a pod in the last 24 hours

No

If OOM occurs on a pod, service functions of the pod are affected. Go to Monitoring Center to view the pod memory and adjust the workload specifications.

Whether the CPU usage of a pod has exceeded 80% in the last 24 hours

Yes

If the resource usage is too high, the pod may be overloaded. This increases the latency and slows down service responses. Choose Monitoring Center > Pods to view the instance metrics. If the metric values reach the thresholds, adjust the container specifications.

Whether the memory usage of a pod has exceeded 80% in the last 24 hours

Yes

Pod configuration

Whether requests are configured for containers in a pod

No

If requests are not configured, Scheduler will be affected, and pods may be scheduled to nodes whose resources cannot meet requirements. High requests will also reduce the resource usage of nodes.

Pod probe configuration

Whether liveness probes are configured for containers in a pod

No

If no liveness probes are configured, application exceptions in a pod cannot be detected, and the pod cannot be restarted in a timely manner, which will affect the QoS. Configure liveness probes for the pod to avoid abnormal applications and restart the pod in a timely manner if applications fail to function normally.

Whether readiness probes are configured for containers in a pod

No

If no readiness probes are configured, requests are still sent to the pod even if it becomes abnormal, which will affect the QoS. Configure readiness probes for the pod so that requests can still be handled even if applications are abnormal.

External dependencies

Scenario

Diagnosis Item

Enabling Monitoring Center

Rectification Solution

Resource quotas of a node

Whether 90% or more of the EVS disk quota has been used

Yes

Sufficient resource quotas are required for node creation in a cluster. If there are insufficient resource quotas, choose Resources > My Quotas and contact customer service to apply for account quotas.

Whether 90% or more of the ECS quota has been used

Yes