Diagnosis Items and Rectification Solutions
Clusters
Scenario |
Diagnosis Item |
Enabling Monitoring Center |
Rectification Solution |
---|---|---|---|
Cluster resource planning |
Whether HA is enabled for master nodes |
Yes |
The cluster has only one master node or a master node is abnormal. If another master node is faulty, the cluster is unavailable, affecting service reliability. To improve service resilience, use an HA cluster or rectify node exceptions. When a master node is faulty, the cluster is still available. |
Whether the CPU requests of pods in the cluster have exceeded 80% of the cluster CPU |
Yes |
A request is the minimum CPU or memory a workload needs. Plan required resources based on service requirements. For details, see Configuring Container Specifications. |
|
Whether the memory requests of pods in the cluster have exceeded 80% of the cluster memory |
Yes |
||
Whether the cluster version has expired |
No |
After the cluster version has reached the end of service, CCE will no longer support the creation of new clusters. You will also no longer be provided with technical support, including new feature updates, vulnerability or issue fixes, new patches, work order guidance, and online checks for the cluster version. The CCE SLA is not valid for such clusters. Go to the Clusters page to upgrade the cluster version. For details, see Process and Method of Upgrading a Cluster. |
|
Cluster O&M |
Whether kube-prometheus-stack is normal |
No |
kube-prometheus-stack provides one-stop cluster monitoring. Go to the Add-ons page to install this add-on and check its status. For details, see Cloud Native Cluster Monitoring. |
Whether log-agent is normal |
No |
log-agent collects and manages workload logs. Go to the Add-ons page to install this add-on and check the add-on status. |
|
Whether npd is normal |
No |
node-problem-detector (npd) monitors the nodes. Go to the Add-ons page to install this add-on and check the add-on status. For details, see CCE Node Problem Detector. |
|
Cluster configuration |
Whether security groups are correctly configured |
No |
Invalid cluster security group configuration makes it impossible for the nodes to communicate with each other. Retain the default security group configuration. |
Core add-ons
Scenario |
Diagnosis Item |
Enabling Monitoring Center |
Rectification Solution |
---|---|---|---|
coredns status |
Whether coredns is normal |
No |
coredns is a mandatory add-on that provides domain name resolution for clusters. If this add-on is not installed or is abnormal, services in the cluster will be affected. Go to the Add-ons page to install this add-on or check the add-on status. |
Whether the CPU usage of coredns has exceeded 80% in the last 24 hours |
Yes |
coredns provides domain name resolution for clusters. If the resource usage is too high, the add-on may be overloaded. Domain name resolution will be affected, and the latency is increased. To prevent services from being affected, analyze the recent QPS of coredns. Go to Monitoring Center, click the Dashboard tab, and select the CoreDNS view to view the instance metrics. If the metric values reach the thresholds, adjust the specifications. |
|
Whether the memory usage of coredns has exceeded 80% in the last 24 hours |
Yes |
||
Whether coredns failed to resolve domain names in the last 24 hours |
Yes |
If coredns failed to resolve domain names, services are affected. |
|
Whether the P99 latency of coredns has exceeded 5s in the last 24 hours |
Yes |
If the latency increases, responses to DNS requests become slow. |
|
everest status |
Whether everest is normal |
No |
everest is a mandatory add-on that provides cloud storage services for clusters. If this add-on is not installed or is abnormal, the cluster storage capability is affected. Go to the Add-ons page to install this add-on or check the add-on status. |
Whether the CPU usage of everest-controller has exceeded 80% in the last 24 hours |
Yes |
everest provides cloud storage services for clusters. If the resource usage is too high, the add-on may be overloaded, and cluster cloud storage is affected. To prevent cloud storage from being affected, analyze the recent load of everest-controller. Choose Monitoring Center > Workloads to view the instance metrics. If the metric values reach the thresholds, adjust the specifications. For details, see CCE Container Storage (Everest). |
|
Whether the memory usage of everest-controller has exceeded 80% in the last 24 hours |
Yes |
||
kube-prometheus-stack status |
Whether kube-prometheus-stack is normal |
No |
kube-prometheus-stack provides one-stop cluster monitoring. Go to the Add-ons page to install this add-on or check the add-on status. |
Whether the CPU usage of the prometheus workload has exceeded 80% in the last 24 hours |
Yes |
kube-prometheus-stack provides cluster monitoring. If the resource usage is too high, kube-prometheus-stack may be overloaded, and cluster monitoring is affected. Choose Monitoring Center > Workloads to view the instance metrics. If the metric values reach the thresholds, adjust the specifications.
NOTE:
The PVC resource usage is checked when the kube-prometheus-stack add-on is deployed . In this mode, the collected metrics are stored in the cluster PV. |
|
Whether the memory usage of the prometheus workload has exceeded 80% in the last 24 hours |
Yes |
||
Whether the PVC usage of prometheus-server exceeded 80% when the prometheus workload is deployed in server mode |
Yes |
||
Whether OOM has occurred for the prometheus workload in the last 24 hours |
No |
kube-prometheus-stack provides cluster monitoring. OOM occurs when the memory usage of the add-on instance reaches the limit. As a result, metric reporting will be affected, and non-HA cluster monitoring will be unavailable. Adjust the specifications of the prometheus instance. |
|
autoscaler status |
Whether autoscaler is available when auto scaling is enabled for node pools |
No |
autoscaler provides auto scaling for clusters. If autoscaler is abnormal, auto scaling that has been enabled for a node pool becomes unavailable. Check the add-on status on the Add-ons page.
NOTE:
The autoscaler status is checked only when auto scaling is enabled for node pools. |
log-agent status |
Whether log-agent is normal |
No |
log-agent collects and manages workload logs. Go to the Add-ons page to install this add-on or check the add-on status. |
Whether default LTS log group and log streams are created |
No |
The default event log group and log streams are the basic units for event reporting in Monitoring Center. If there are no log group and log streams, event reporting is unavailable. For details about how to create a log group and log streams, see Collecting Container Logs Using Cloud Native Logging. |
Nodes
Scenario |
Diagnosis Item |
Enabling Monitoring Center |
Rectification Solution |
---|---|---|---|
Node status |
Whether nodes are ready |
Yes |
If a node is not ready, services running on the node may be affected. Rectify the fault in a timely manner. |
Whether nodes can be scheduled |
Yes |
If a node cannot be scheduled, node resources cannot be used. Go to the CCE node management page to check whether the node status meets the expectation. |
|
Whether kubelet is normal |
Yes |
kubelet is a key component of the nodes. If kubelet is abnormal, the nodes may be abnormal and the pod status is inconsistent with that on the API server. Run the journalctl -l -u kubelet command on each node to view the kubelet log and locate the cause. |
|
Node configuration |
Whether the memory requests of pods on a node have exceeded 80% of the node memory |
Yes |
The minimum CPU and memory requested by a node determine whether new applications can be scheduled to the node. If the request is higher than the available resource, no applications will be scheduled to the node. The results show that the resource requests have exceeded the minimum values. Plan required resources for your applications based on the results. |
Whether the CPU requests of pods on a node have exceeded 80% of the node CPU |
Yes |
||
Resource requests and limits of nodes |
Whether the CPU usage of a node has exceeded 80% in the last 24 hours |
Yes |
If the node CPU usage is too high, the workloads running on the node will be affected. Go to Monitoring Center to view the node CPU usage. Then plan required node resources or expand the node capacity. |
Whether the memory usage of a node has exceeded 80% in the last 24 hours |
Yes |
If the memory usage of a node is too high, there are OOM risks, affecting service availability on the node. Go to Monitoring Center to view the node memory usage. Then plan required node resources or expand the node capacity. |
|
Whether the disk usage of a node has exceeded 80% |
Yes |
If the node disk usage is too high, the pods will be affected. Expand capacity in a timely manner. Run the following commands to view disk details:
|
|
Whether the number of PIDs for a node exceeds the limit |
Yes |
The node is experiencing PID pressure, making it unstable. Release unnecessary processes on the node or modify the PID limit. Run the following commands to view PID details:
|
|
Whether OOM has occurred on a node in the last 24 hours |
Yes |
If OOM occurs on a node, service functions on the node are affected. Go to Monitoring Center to view the node memory. Then plan required resources or expand the capacity. |
Workloads
Scenario |
Diagnosis Item |
Enabling Monitoring Center |
Rectification Solution |
---|---|---|---|
Pod status |
Whether pods are normal |
No |
If a pod fails to function normally, the workload performance for that pod may deteriorate. If there are no replicas available, the pod may be inaccessible. Run the following commands to view pod details:
|
Pod workload |
Whether OOM has occurred on a pod in the last 24 hours |
No |
If OOM occurs on a pod, service functions of the pod are affected. Go to Monitoring Center to view the pod memory and adjust the workload specifications. |
Whether the CPU usage of a pod has exceeded 80% in the last 24 hours |
Yes |
If the resource usage is too high, the pod may be overloaded. This increases the latency and slows down service responses. Choose Monitoring Center > Pods to view the instance metrics. If the metric values reach the thresholds, adjust the container specifications. |
|
Whether the memory usage of a pod has exceeded 80% in the last 24 hours |
Yes |
||
Pod configuration |
Whether requests are configured for containers in a pod |
No |
If requests are not configured, Scheduler will be affected, and pods may be scheduled to nodes whose resources cannot meet requirements. High requests will also reduce the resource usage of nodes. |
Pod probe configuration |
Whether liveness probes are configured for containers in a pod |
No |
If no liveness probes are configured, application exceptions in a pod cannot be detected, and the pod cannot be restarted in a timely manner, which will affect the QoS. Configure liveness probes for the pod to avoid abnormal applications and restart the pod in a timely manner if applications fail to function normally. |
Whether readiness probes are configured for containers in a pod |
No |
If no readiness probes are configured, requests are still sent to the pod even if it becomes abnormal, which will affect the QoS. Configure readiness probes for the pod so that requests can still be handled even if applications are abnormal. |
External dependencies
Scenario |
Diagnosis Item |
Enabling Monitoring Center |
Rectification Solution |
---|---|---|---|
Resource quotas of a node |
Whether 90% or more of the EVS disk quota has been used |
Yes |
Sufficient resource quotas are required for node creation in a cluster. If there are insufficient resource quotas, choose Resources > My Quotas and contact customer service to apply for account quotas. |
Whether 90% or more of the ECS quota has been used |
Yes |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot