Health Diagnosis
Overview
An important function of CIA is to diagnose the health of clusters. CIA automatically checks whether clusters, nodes, workloads, core add-ons, and external dependencies are healthy based on cluster configurations and metrics reported by the kube-prometheus-stack add-on to AOM. CIA also provides diagnosis results and rectification suggestions for abnormal items based on best O&M practices of Kubernetes clusters.
Constraints
- The cluster version is later than v1.17.
- The clusters are in the Running state.
Viewing Health Diagnosis Results
- Select a fleet or a cluster that is not added to the fleet.
Figure 1 Selecting a fleet or a cluster not in the fleet
- Click the Health Diagnosis tab to view the numbers of normal clusters and risky clusters.
Figure 2 Health diagnosis
- In Diagnosis Result, view the diagnosis results of the current cluster.
Click and click View Diagnosis Details to access the diagnosis details page and view diagnosis items and results.Figure 3 Diagnosis results
Configuring a Scheduled Inspection
- Select a fleet or a cluster that is not added to the fleet.
Figure 4 Selecting a fleet or a cluster not in the fleet
- Choose Container insight > Clusters to view the clusters for which monitoring has been enabled.
- Click Health Diagnosis, enable Scheduled Inspection in the upper right corner, and configure the start time of the inspection.
The inspection will automatically start at the specified time. A cluster can be scheduled to be inspected only once every day.
Figure 5 Scheduled inspection configuration
You can also go to the inspection details page of a cluster as instructed in Viewing Health Diagnosis Results.
Health Diagnosis
- Go to the inspection details page of a cluster as instructed in Viewing Health Diagnosis Results.
- In the Cluster Inspection area, select the cluster that is not inspected and click Diagnose Now.
After the diagnosis is complete, the page will be automatically refreshed to display the diagnosis results. Normal items are hidden by default.
Kubernetes problems will be summarized from the abnormal items. Troubleshooting suggestions will also be provided. You can click View Diagnosis Details to view the diagnosis details and rectification suggestions of a specific diagnosis item.
Figure 6 Diagnosis details
Inspection Items
Dimension |
Scenario |
Inspection Item |
Cluster |
Cluster resource planning |
Whether HA is enabled for master nodes |
Whether the CPU requests of pods in the cluster have exceeded 80% of the cluster CPU |
||
Whether the CPU limits of pods in the cluster have exceeded 150% of the cluster CPU |
||
Whether the memory requests of pods in the cluster have exceeded 80% of the cluster memory |
||
Whether the memory limits of pods in the cluster have exceeded 150% of the cluster memory |
||
Whether the cluster version has expired |
||
Cluster O&M |
Whether kube-prometheus-stack is normal |
|
Whether log-agent is normal |
||
Whether npd is normal |
||
Cluster configuration |
Whether security groups are correctly configured |
|
Core add-ons |
Whether coredns status is normal |
Whether the CPU usage of coredns has exceeded 80% in the last 24 hours |
Whether the memory usage of coredns has exceeded 80% in the last 24 hours |
||
Whether coredns failed to resolve domain names for more than XX times in the last 24 hours |
||
Whether the P99 latency of coredns has exceeded 5s in the last 24 hours |
||
Whether coredns is normal |
||
Whether everest status is normal |
Whether everest is normal |
|
Whether the CPU usage of everest has exceeded 80% in the last 24 hours |
||
Whether the memory usage of everest has exceeded 80% in the last 24 hours |
||
Whether kube-prometheus-stack status is normal |
Whether the CPU usage of kube-prometheus-stack has exceeded 80% in the last 24 hours |
|
Whether the memory usage of kube-prometheus-stack has exceeded 80% in the last 24 hours |
||
Whether kube-prometheus-status is normal |
||
Whether OOM occurred on kube-prometheus-status in the last 24 hours |
||
Whether the PVC usage of prometheus-server has exceeded 80% when kube-prometheus-status is deployed in server mode |
||
Whether log-agent status is normal |
Whether log-agent is normal |
|
Whether LTS log groups and log stream are created successfully |
||
Whether log structuring is enabled for LTS log groups |
||
autoscaler status |
Whether autoscaler is available when auto scaling is enabled for node pools |
|
Node |
Node status |
Whether nodes are ready |
Whether nodes can be scheduled |
||
Whether kubelet is normal |
||
Node configuration |
Whether the memory requests of pods on a node have exceeded 80% of the node memory |
|
Whether the CPU requests of pods on a node have exceeded 80% of the node CPU |
||
Whether the memory limits of pods on a node have exceeded 150% of the node memory |
||
Whether the CPU limits of pods on a node have exceeded 150% of the node CPU |
||
Resource requests and limits of nodes |
Whether the CPU usage of a node has exceeded 80% in the last 24 hours |
|
Whether the memory usage of a node has exceeded 80% in the last 24 hours |
||
Whether the disk usage of a node has exceeded 80% |
||
Whether the number of PIDs for a node exceeds the limit |
||
Whether OOM has occurred on a node in the last 24 hours |
||
Workload |
Pod status |
Whether pods are normal |
Pod workload |
Whether OOM has occurred on a pod in the last 24 hours |
|
Whether the CPU usage of a pod has exceeded 80% in the last 24 hours |
||
Whether the memory usage of a pod has exceeded 80% in the last 24 hours |
||
Pod configuration |
Whether requests are configured for containers in a pod |
|
Whether limits are configured for containers in a pod |
||
Pod probe configuration |
Whether liveness probes are configured for containers in a pod |
|
Whether readiness probes are configured for containers in a pod |
||
External dependency |
Resource quotas of a node |
Whether 90% or more of the EVS disk quota has been used |
Whether 90% or more of the ECS quota has been used |
Dimension |
Scenario |
Inspection Item |
Cluster |
Cluster resource planning |
Whether HA is enabled for master nodes |
Whether the CPU requests of pods in the cluster have exceeded 80% of the cluster CPU |
||
Whether the CPU limits of pods in the cluster have exceeded 150% of the cluster CPU |
||
Whether the memory requests of pods in the cluster have exceeded 80% of the cluster memory |
||
Whether the memory limits of pods in the cluster have exceeded 150% of the cluster memory |
||
Cluster O&M |
Whether kube-prometheus-stack is normal |
|
Whether log-agent is normal |
||
Core add-ons |
Whether kube-prometheus-stack status is normal |
Whether the CPU usage of kube-prometheus-stack has exceeded 80% in the last 24 hours |
Whether the memory usage of kube-prometheus-stack has exceeded 80% in the last 24 hours |
||
Whether kube-prometheus-status is normal |
||
Whether OOM occurred on kube-prometheus-status in the last 24 hours |
||
Whether log-agent status is normal |
Whether log-agent is normal |
|
Whether LTS log groups and log stream are created successfully |
||
Whether log structuring is enabled for LTS log groups |
||
Node |
Node status |
Whether nodes are ready |
Whether nodes can be scheduled |
||
Whether kubelet is normal |
||
Node configuration |
Whether the memory requests of pods on a node have exceeded 80% of the node memory |
|
Whether the CPU requests of pods on a node have exceeded 80% of the node CPU |
||
Whether the memory limits of pods on a node have exceeded 150% of the node memory |
||
Whether the CPU limits of pods on a node have exceeded 150% of the node CPU |
||
Resource requests and limits of nodes |
Whether the CPU usage of a node has exceeded 80% in the last 24 hours |
|
Whether the memory usage of a node has exceeded 80% in the last 24 hours |
||
Whether the disk usage of a node has exceeded 80% |
||
Whether the number of PIDs for a node exceeds the limit |
||
Whether OOM has occurred on a node in the last 24 hours |
||
Workload |
Pod status |
Whether pods are normal |
Pod workload |
Whether OOM has occurred on a pod in the last 24 hours |
|
Whether the CPU usage of a pod has exceeded 80% in the last 24 hours |
||
Whether the memory usage of a pod has exceeded 80% in the last 24 hours |
||
Pod configuration |
Whether requests are configured for containers in a pod |
|
Whether limits are configured for containers in a pod |
||
Pod probe configuration |
Whether liveness probes are configured for containers in a pod |
|
Whether readiness probes are configured for containers in a pod |
||
External dependency |
Resource quotas of a node |
Whether 90% or more of the EVS disk quota has been used |
Whether 90% or more of the ECS quota has been used |
Dimension |
Scenario |
Inspection Item |
Cluster |
Cluster resource planning |
Whether HA is enabled for master nodes |
Whether the CPU requests of pods in the cluster have exceeded 80% of the cluster CPU |
||
Whether the CPU limits of pods in the cluster have exceeded 150% of the cluster CPU |
||
Whether the memory requests of pods in the cluster have exceeded 80% of the cluster memory |
||
Whether the memory limits of pods in the cluster have exceeded 150% of the cluster memory |
||
Cluster O&M |
Whether kube-prometheus-stack is normal |
|
Core add-ons |
Whether kube-prometheus-stack status is normal |
Whether the CPU usage of kube-prometheus-stack has exceeded 80% in the last 24 hours |
Whether the memory usage of kube-prometheus-stack has exceeded 80% in the last 24 hours |
||
Whether kube-prometheus-status is normal |
||
Whether OOM occurred on kube-prometheus-status in the last 24 hours |
||
Node |
Node status |
Whether nodes are ready |
Whether nodes can be scheduled |
||
Whether kubelet is normal |
||
Node configuration |
Whether the memory requests of pods on a node have exceeded 80% of the node memory |
|
Whether the CPU requests of pods on a node have exceeded 80% of the node CPU |
||
Whether the memory limits of pods on a node have exceeded 150% of the node memory |
||
Whether the CPU limits of pods on a node have exceeded 150% of the node CPU |
||
Resource requests and limits of nodes |
Whether the CPU usage of a node has exceeded 80% in the last 24 hours |
|
Whether the memory usage of a node has exceeded 80% in the last 24 hours |
||
Whether the disk usage of a node has exceeded 80% |
||
Whether the number of PIDs for a node exceeds the limit |
||
Whether OOM has occurred on a node in the last 24 hours |
||
Workload |
Pod status |
Whether pods are normal |
Pod workload |
Whether OOM has occurred on a pod in the last 24 hours |
|
Whether the CPU usage of a pod has exceeded 80% in the last 24 hours |
||
Whether the memory usage of a pod has exceeded 80% in the last 24 hours |
||
Pod configuration |
Whether requests are configured for containers in a pod |
|
Whether limits are configured for containers in a pod |
||
Pod probe configuration |
Whether liveness probes are configured for containers in a pod |
|
Whether readiness probes are configured for containers in a pod |
||
External dependency |
Resource quotas of a node |
Whether 90% or more of the EVS disk quota has been used |
Whether 90% or more of the ECS quota has been used |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot