Cluster Monitoring

To observe the resource usage and health of a cluster, choose Monitoring Center > Clusters. The monitoring data is displayed, where you can view the Cluster Health, Health Overview, Top Resource Consumption Statistics, and Data Plane Monitoring.

Navigation Path

Log in to the CCE console and click the cluster name to access the cluster console.
In the navigation pane, choose Monitoring Center. Then, click the Clusters tab.

Cluster Health

Cluster health is evaluated from several dimensions, such as the health score, number of risk items to be processed, risk level, and proportion of diagnosed risk items for master nodes, clusters, worker nodes, workloads, and external dependencies. Abnormal data is displayed in red. For more diagnosis results, go to Health Center.

Figure 1 Cluster health
Click to enlarge

Health Overview

Resource Overview

Resource Overview displays the percentage of abnormal resources in nodes, workloads, and pods and the total number of namespaces.

Control Plane Health Overview

Control Plane Health Overview displays the percentage of exceptions on control plane components and master nodes, total QPS of the API server, and request error rate of the API server. If the API server (the API service provider of the cluster) on the control plane is abnormal, the cluster may fail to be accessed, and workloads that depend on the API server may fail to run normally. The QPS and request error rate help you quickly identify and rectify faults.

Figure 2 Health overview
Click to enlarge

Top Resource Consumption Statistics

CCE collects statistics on top 5 nodes, Deployments, StatefulSets, and pods by CPU and memory usages, helping you identify high resource consumption. To view all data, click the nodes, workloads, or pods tab.

Figure 3 Top Resource Consumption Statistics
Click to enlarge

Metrics

CPU Usage
- Node CPU usage = Average percentage of the time that the node CPU is not in an idle state
- Workload CPU usage = Average CPU usage in each pod of the workload
- Pod CPU usage = CPU cores used by the pod/CPU core limits of all service containers in the pod × 100% (Total CPU cores of the node will be used if no limits are configured.)
Memory Usage
- Node memory usage = Memory used by the node/Total memory of the node × 100%
- Workload memory usage = Average memory usage in each pod of the workload
- Pod memory usage = The physical memory used by the pod/Memory limits of all containers in the pod × 100% (Total memory of the node will be used if no limits are configured.)

Data Plane Monitoring

By default, the resource usage is collected from each dimension in the last hour, last 8 hours, and last 24 hours. To view more monitoring information, click View All Metrics to access the Dashboard page. For details, see Using Dashboard.

You can hover over a chart to view the monitoring data in each minute.

CPU: the CPU used by a cluster in a specified period
Memory: the memory used by a cluster in a specified period
PVC Storage Status: the binding between PVCs and PVs
Pod Status and Quantity: real-time status and number of pods in a cluster
Trend of Total Pod Restarts: the total number of pod restarts in the cluster in the last 5 minutes
Node Status Trend: real-time status of nodes in a cluster