Updated on 2024-06-17 GMT+08:00

Health Diagnosis

Overview

An important function of CIA is to diagnose the health of clusters. CIA automatically checks whether clusters, nodes, workloads, core add-ons, and external dependencies are healthy based on cluster configurations and metrics reported by the kube-prometheus-stack add-on to AOM. CIA also provides diagnosis results and rectification suggestions for abnormal items based on best O&M practices of Kubernetes clusters.

Constraints

  • The cluster version is later than v1.17.
  • The clusters are in the Running state.

Viewing Health Diagnosis Results

  1. Select a fleet or a cluster that is not added to the fleet.

    Figure 1 Selecting a fleet or a cluster not in the fleet

  2. Click the Health Diagnosis tab to view the numbers of normal clusters and risky clusters.

    Figure 2 Health diagnosis

  3. In Diagnosis Result, view the diagnosis results of the current cluster.

    Click and click View Diagnosis Details to access the diagnosis details page and view diagnosis items and results.
    Figure 3 Diagnosis results

Configuring a Scheduled Inspection

  1. Select a fleet or a cluster that is not added to the fleet.

    Figure 4 Selecting a fleet or a cluster not in the fleet

  2. Choose Container insight > Clusters to view the clusters for which monitoring has been enabled.
  3. Click Health Diagnosis, enable Scheduled Inspection in the upper right corner, and configure the start time of the inspection.

    The inspection will automatically start at the specified time. A cluster can be scheduled to be inspected only once every day.

    Figure 5 Scheduled inspection configuration

    You can also go to the inspection details page of a cluster as instructed in Viewing Health Diagnosis Results.

Health Diagnosis

  1. Go to the inspection details page of a cluster as instructed in Viewing Health Diagnosis Results.
  2. In the Cluster Inspection area, select the cluster that is not inspected and click Diagnose Now.

    After the diagnosis is complete, the page will be automatically refreshed to display the diagnosis results. Normal items are hidden by default.

    Kubernetes problems will be summarized from the abnormal items. Troubleshooting suggestions will also be provided. You can click View Diagnosis Details to view the diagnosis details and rectification suggestions of a specific diagnosis item.

    Figure 6 Diagnosis details

Inspection Items

Table 1 Inspection items for CCE clusters

Dimension

Scenario

Inspection Item

Cluster

Cluster resource planning

Whether HA is enabled for master nodes

Whether the CPU requests of pods in the cluster have exceeded 80% of the cluster CPU

Whether the CPU limits of pods in the cluster have exceeded 150% of the cluster CPU

Whether the memory requests of pods in the cluster have exceeded 80% of the cluster memory

Whether the memory limits of pods in the cluster have exceeded 150% of the cluster memory

Whether the cluster version has expired

Cluster O&M

Whether kube-prometheus-stack is normal

Whether log-agent is normal

Whether npd is normal

Cluster configuration

Whether security groups are correctly configured

Core add-ons

Whether coredns status is normal

Whether the CPU usage of coredns has exceeded 80% in the last 24 hours

Whether the memory usage of coredns has exceeded 80% in the last 24 hours

Whether coredns failed to resolve domain names for more than XX times in the last 24 hours

Whether the P99 latency of coredns has exceeded 5s in the last 24 hours

Whether coredns is normal

Whether everest status is normal

Whether everest is normal

Whether the CPU usage of everest has exceeded 80% in the last 24 hours

Whether the memory usage of everest has exceeded 80% in the last 24 hours

Whether kube-prometheus-stack status is normal

Whether the CPU usage of kube-prometheus-stack has exceeded 80% in the last 24 hours

Whether the memory usage of kube-prometheus-stack has exceeded 80% in the last 24 hours

Whether kube-prometheus-status is normal

Whether OOM occurred on kube-prometheus-status in the last 24 hours

Whether the PVC usage of prometheus-server has exceeded 80% when kube-prometheus-status is deployed in server mode

Whether log-agent status is normal

Whether log-agent is normal

Whether LTS log groups and log stream are created successfully

Whether log structuring is enabled for LTS log groups

autoscaler status

Whether autoscaler is available when auto scaling is enabled for node pools

Node

Node status

Whether nodes are ready

Whether nodes can be scheduled

Whether kubelet is normal

Node configuration

Whether the memory requests of pods on a node have exceeded 80% of the node memory

Whether the CPU requests of pods on a node have exceeded 80% of the node CPU

Whether the memory limits of pods on a node have exceeded 150% of the node memory

Whether the CPU limits of pods on a node have exceeded 150% of the node CPU

Resource requests and limits of nodes

Whether the CPU usage of a node has exceeded 80% in the last 24 hours

Whether the memory usage of a node has exceeded 80% in the last 24 hours

Whether the disk usage of a node has exceeded 80%

Whether the number of PIDs for a node exceeds the limit

Whether OOM has occurred on a node in the last 24 hours

Workload

Pod status

Whether pods are normal

Pod workload

Whether OOM has occurred on a pod in the last 24 hours

Whether the CPU usage of a pod has exceeded 80% in the last 24 hours

Whether the memory usage of a pod has exceeded 80% in the last 24 hours

Pod configuration

Whether requests are configured for containers in a pod

Whether limits are configured for containers in a pod

Pod probe configuration

Whether liveness probes are configured for containers in a pod

Whether readiness probes are configured for containers in a pod

External dependency

Resource quotas of a node

Whether 90% or more of the EVS disk quota has been used

Whether 90% or more of the ECS quota has been used

Table 2 Inspection items for on-premises clusters

Dimension

Scenario

Inspection Item

Cluster

Cluster resource planning

Whether HA is enabled for master nodes

Whether the CPU requests of pods in the cluster have exceeded 80% of the cluster CPU

Whether the CPU limits of pods in the cluster have exceeded 150% of the cluster CPU

Whether the memory requests of pods in the cluster have exceeded 80% of the cluster memory

Whether the memory limits of pods in the cluster have exceeded 150% of the cluster memory

Cluster O&M

Whether kube-prometheus-stack is normal

Whether log-agent is normal

Core add-ons

Whether kube-prometheus-stack status is normal

Whether the CPU usage of kube-prometheus-stack has exceeded 80% in the last 24 hours

Whether the memory usage of kube-prometheus-stack has exceeded 80% in the last 24 hours

Whether kube-prometheus-status is normal

Whether OOM occurred on kube-prometheus-status in the last 24 hours

Whether log-agent status is normal

Whether log-agent is normal

Whether LTS log groups and log stream are created successfully

Whether log structuring is enabled for LTS log groups

Node

Node status

Whether nodes are ready

Whether nodes can be scheduled

Whether kubelet is normal

Node configuration

Whether the memory requests of pods on a node have exceeded 80% of the node memory

Whether the CPU requests of pods on a node have exceeded 80% of the node CPU

Whether the memory limits of pods on a node have exceeded 150% of the node memory

Whether the CPU limits of pods on a node have exceeded 150% of the node CPU

Resource requests and limits of nodes

Whether the CPU usage of a node has exceeded 80% in the last 24 hours

Whether the memory usage of a node has exceeded 80% in the last 24 hours

Whether the disk usage of a node has exceeded 80%

Whether the number of PIDs for a node exceeds the limit

Whether OOM has occurred on a node in the last 24 hours

Workload

Pod status

Whether pods are normal

Pod workload

Whether OOM has occurred on a pod in the last 24 hours

Whether the CPU usage of a pod has exceeded 80% in the last 24 hours

Whether the memory usage of a pod has exceeded 80% in the last 24 hours

Pod configuration

Whether requests are configured for containers in a pod

Whether limits are configured for containers in a pod

Pod probe configuration

Whether liveness probes are configured for containers in a pod

Whether readiness probes are configured for containers in a pod

External dependency

Resource quotas of a node

Whether 90% or more of the EVS disk quota has been used

Whether 90% or more of the ECS quota has been used

Table 3 Inspection items for attached clusters, multi-cloud clusters, and partner cloud clusters

Dimension

Scenario

Inspection Item

Cluster

Cluster resource planning

Whether HA is enabled for master nodes

Whether the CPU requests of pods in the cluster have exceeded 80% of the cluster CPU

Whether the CPU limits of pods in the cluster have exceeded 150% of the cluster CPU

Whether the memory requests of pods in the cluster have exceeded 80% of the cluster memory

Whether the memory limits of pods in the cluster have exceeded 150% of the cluster memory

Cluster O&M

Whether kube-prometheus-stack is normal

Core add-ons

Whether kube-prometheus-stack status is normal

Whether the CPU usage of kube-prometheus-stack has exceeded 80% in the last 24 hours

Whether the memory usage of kube-prometheus-stack has exceeded 80% in the last 24 hours

Whether kube-prometheus-status is normal

Whether OOM occurred on kube-prometheus-status in the last 24 hours

Node

Node status

Whether nodes are ready

Whether nodes can be scheduled

Whether kubelet is normal

Node configuration

Whether the memory requests of pods on a node have exceeded 80% of the node memory

Whether the CPU requests of pods on a node have exceeded 80% of the node CPU

Whether the memory limits of pods on a node have exceeded 150% of the node memory

Whether the CPU limits of pods on a node have exceeded 150% of the node CPU

Resource requests and limits of nodes

Whether the CPU usage of a node has exceeded 80% in the last 24 hours

Whether the memory usage of a node has exceeded 80% in the last 24 hours

Whether the disk usage of a node has exceeded 80%

Whether the number of PIDs for a node exceeds the limit

Whether OOM has occurred on a node in the last 24 hours

Workload

Pod status

Whether pods are normal

Pod workload

Whether OOM has occurred on a pod in the last 24 hours

Whether the CPU usage of a pod has exceeded 80% in the last 24 hours

Whether the memory usage of a pod has exceeded 80% in the last 24 hours

Pod configuration

Whether requests are configured for containers in a pod

Whether limits are configured for containers in a pod

Pod probe configuration

Whether liveness probes are configured for containers in a pod

Whether readiness probes are configured for containers in a pod

External dependency

Resource quotas of a node

Whether 90% or more of the EVS disk quota has been used

Whether 90% or more of the ECS quota has been used