Health Diagnosis

Overview

An important function of CIA is to diagnose the health of clusters. CIA automatically checks whether clusters, nodes, workloads, core add-ons, and external dependencies are healthy based on cluster configurations and metrics reported by the kube-prometheus-stack add-on to AOM. CIA also provides diagnosis results and rectification suggestions for abnormal items based on best O&M practices of Kubernetes clusters.

Constraints

The cluster version is later than v1.17.
The clusters are in the Running state.

Viewing Health Diagnosis Results

Select a fleet or a cluster that is not added to the fleet.

Figure 1 Selecting a fleet or a cluster not in the fleet
Click the Health Diagnosis tab to view the numbers of normal clusters and risky clusters.

Figure 2 Health diagnosis
In Diagnosis Result, view the diagnosis results of the current cluster.

Click and click View Diagnosis Details to access the diagnosis details page and view diagnosis items and results.
Figure 3 Diagnosis results

Configuring a Scheduled Inspection

Select a fleet or a cluster that is not added to the fleet.

Figure 4 Selecting a fleet or a cluster not in the fleet
Choose Container insight > Clusters to view the clusters for which monitoring has been enabled.
Click Health Diagnosis, enable Scheduled Inspection in the upper right corner, and configure the start time of the inspection.

The inspection will automatically start at the specified time. A cluster can be scheduled to be inspected only once every day.

Figure 5 Scheduled inspection configuration

You can also go to the inspection details page of a cluster as instructed in Viewing Health Diagnosis Results.

Health Diagnosis

Go to the inspection details page of a cluster as instructed in Viewing Health Diagnosis Results.
In the Cluster Inspection area, select the cluster that is not inspected and click Diagnose Now.

After the diagnosis is complete, the page will be automatically refreshed to display the diagnosis results. Normal items are hidden by default.

Kubernetes problems will be summarized from the abnormal items. Troubleshooting suggestions will also be provided. You can click View Diagnosis Details to view the diagnosis details and rectification suggestions of a specific diagnosis item.

Figure 6 Diagnosis details

Inspection Items

**Table 1** Inspection items for CCE clusters
Dimension	Scenario	Inspection Item
Cluster	Cluster resource planning	Whether HA is enabled for master nodes
		Whether the CPU requests of pods in the cluster have exceeded 80% of the cluster CPU
		Whether the CPU limits of pods in the cluster have exceeded 150% of the cluster CPU
		Whether the memory requests of pods in the cluster have exceeded 80% of the cluster memory
		Whether the memory limits of pods in the cluster have exceeded 150% of the cluster memory
		Whether the cluster version has expired
	Cluster O&M	Whether kube-prometheus-stack is normal
		Whether log-agent is normal
		Whether npd is normal
	Cluster configuration	Whether security groups are correctly configured
Core add-ons	Whether coredns status is normal	Whether the CPU usage of coredns has exceeded 80% in the last 24 hours
		Whether the memory usage of coredns has exceeded 80% in the last 24 hours
		Whether coredns failed to resolve domain names for more than XX times in the last 24 hours
		Whether the P99 latency of coredns has exceeded 5s in the last 24 hours
		Whether coredns is normal
	Whether everest status is normal	Whether everest is normal
		Whether the CPU usage of everest has exceeded 80% in the last 24 hours
		Whether the memory usage of everest has exceeded 80% in the last 24 hours
	Whether kube-prometheus-stack status is normal	Whether the CPU usage of kube-prometheus-stack has exceeded 80% in the last 24 hours
		Whether the memory usage of kube-prometheus-stack has exceeded 80% in the last 24 hours
		Whether kube-prometheus-status is normal
		Whether OOM occurred on kube-prometheus-status in the last 24 hours
		Whether the PVC usage of prometheus-server has exceeded 80% when kube-prometheus-status is deployed in server mode
	Whether log-agent status is normal	Whether log-agent is normal
		Whether LTS log groups and log stream are created successfully
		Whether log structuring is enabled for LTS log groups
	autoscaler status	Whether autoscaler is available when auto scaling is enabled for node pools
Node	Node status	Whether nodes are ready
		Whether nodes can be scheduled
		Whether kubelet is normal
	Node configuration	Whether the memory requests of pods on a node have exceeded 80% of the node memory
		Whether the CPU requests of pods on a node have exceeded 80% of the node CPU
		Whether the memory limits of pods on a node have exceeded 150% of the node memory
		Whether the CPU limits of pods on a node have exceeded 150% of the node CPU
	Resource requests and limits of nodes	Whether the CPU usage of a node has exceeded 80% in the last 24 hours
		Whether the memory usage of a node has exceeded 80% in the last 24 hours
		Whether the disk usage of a node has exceeded 80%
		Whether the number of PIDs for a node exceeds the limit
		Whether OOM has occurred on a node in the last 24 hours
Workload	Pod status	Whether pods are normal
	Pod workload	Whether OOM has occurred on a pod in the last 24 hours
		Whether the CPU usage of a pod has exceeded 80% in the last 24 hours
		Whether the memory usage of a pod has exceeded 80% in the last 24 hours
	Pod configuration	Whether requests are configured for containers in a pod
	Pod configuration	Whether limits are configured for containers in a pod
	Pod probe configuration	Whether liveness probes are configured for containers in a pod
	Pod probe configuration	Whether readiness probes are configured for containers in a pod
External dependency	Resource quotas of a node	Whether 90% or more of the EVS disk quota has been used
External dependency	Resource quotas of a node	Whether 90% or more of the ECS quota has been used

**Table 2** Inspection items for on-premises clusters
Dimension	Scenario	Inspection Item
Cluster	Cluster resource planning	Whether HA is enabled for master nodes
		Whether the CPU requests of pods in the cluster have exceeded 80% of the cluster CPU
		Whether the CPU limits of pods in the cluster have exceeded 150% of the cluster CPU
		Whether the memory requests of pods in the cluster have exceeded 80% of the cluster memory
		Whether the memory limits of pods in the cluster have exceeded 150% of the cluster memory
	Cluster O&M	Whether kube-prometheus-stack is normal
	Cluster O&M	Whether log-agent is normal
Core add-ons	Whether kube-prometheus-stack status is normal	Whether the CPU usage of kube-prometheus-stack has exceeded 80% in the last 24 hours
		Whether the memory usage of kube-prometheus-stack has exceeded 80% in the last 24 hours
		Whether kube-prometheus-status is normal
		Whether OOM occurred on kube-prometheus-status in the last 24 hours
	Whether log-agent status is normal	Whether log-agent is normal
		Whether LTS log groups and log stream are created successfully
		Whether log structuring is enabled for LTS log groups
Node	Node status	Whether nodes are ready
		Whether nodes can be scheduled
		Whether kubelet is normal
	Node configuration	Whether the memory requests of pods on a node have exceeded 80% of the node memory
		Whether the CPU requests of pods on a node have exceeded 80% of the node CPU
		Whether the memory limits of pods on a node have exceeded 150% of the node memory
		Whether the CPU limits of pods on a node have exceeded 150% of the node CPU
	Resource requests and limits of nodes	Whether the CPU usage of a node has exceeded 80% in the last 24 hours
		Whether the memory usage of a node has exceeded 80% in the last 24 hours
		Whether the disk usage of a node has exceeded 80%
		Whether the number of PIDs for a node exceeds the limit
		Whether OOM has occurred on a node in the last 24 hours
Workload	Pod status	Whether pods are normal
	Pod workload	Whether OOM has occurred on a pod in the last 24 hours
		Whether the CPU usage of a pod has exceeded 80% in the last 24 hours
		Whether the memory usage of a pod has exceeded 80% in the last 24 hours
	Pod configuration	Whether requests are configured for containers in a pod
	Pod configuration	Whether limits are configured for containers in a pod
	Pod probe configuration	Whether liveness probes are configured for containers in a pod
	Pod probe configuration	Whether readiness probes are configured for containers in a pod
External dependency	Resource quotas of a node	Whether 90% or more of the EVS disk quota has been used
External dependency	Resource quotas of a node	Whether 90% or more of the ECS quota has been used

**Table 3** Inspection items for attached clusters, multi-cloud clusters, and partner cloud clusters
Dimension	Scenario	Inspection Item
Cluster	Cluster resource planning	Whether HA is enabled for master nodes
		Whether the CPU requests of pods in the cluster have exceeded 80% of the cluster CPU
		Whether the CPU limits of pods in the cluster have exceeded 150% of the cluster CPU
		Whether the memory requests of pods in the cluster have exceeded 80% of the cluster memory
		Whether the memory limits of pods in the cluster have exceeded 150% of the cluster memory
	Cluster O&M	Whether kube-prometheus-stack is normal
Core add-ons	Whether kube-prometheus-stack status is normal	Whether the CPU usage of kube-prometheus-stack has exceeded 80% in the last 24 hours
		Whether the memory usage of kube-prometheus-stack has exceeded 80% in the last 24 hours
		Whether kube-prometheus-status is normal
		Whether OOM occurred on kube-prometheus-status in the last 24 hours
Node	Node status	Whether nodes are ready
		Whether nodes can be scheduled
		Whether kubelet is normal
	Node configuration	Whether the memory requests of pods on a node have exceeded 80% of the node memory
		Whether the CPU requests of pods on a node have exceeded 80% of the node CPU
		Whether the memory limits of pods on a node have exceeded 150% of the node memory
		Whether the CPU limits of pods on a node have exceeded 150% of the node CPU
	Resource requests and limits of nodes	Whether the CPU usage of a node has exceeded 80% in the last 24 hours
		Whether the memory usage of a node has exceeded 80% in the last 24 hours
		Whether the disk usage of a node has exceeded 80%
		Whether the number of PIDs for a node exceeds the limit
		Whether OOM has occurred on a node in the last 24 hours
Workload	Pod status	Whether pods are normal
	Pod workload	Whether OOM has occurred on a pod in the last 24 hours
		Whether the CPU usage of a pod has exceeded 80% in the last 24 hours
		Whether the memory usage of a pod has exceeded 80% in the last 24 hours
	Pod configuration	Whether requests are configured for containers in a pod
	Pod configuration	Whether limits are configured for containers in a pod
	Pod probe configuration	Whether liveness probes are configured for containers in a pod
	Pod probe configuration	Whether readiness probes are configured for containers in a pod
External dependency	Resource quotas of a node	Whether 90% or more of the EVS disk quota has been used
External dependency	Resource quotas of a node	Whether 90% or more of the ECS quota has been used