System Check
Scenario
System Steward consists of system check and system hardening. This topic describes the system check function.
System check detects faults or exceptions on nodes in real time.
Prerequisites
- Before using the system check function, you must install the npd add-on, which is used to detect node exceptions.
- Before using the system check function, you must install the prometheus add-on, which is used to obtain abnormal metrics reported by the npd add-on.
Procedure
- Log in to the CCE console. In the navigation pane on the left, choose System Steward > System Check.
- In the left pane of the System Check page, choose the node for which you want to perform a system check. The Indicator Check, Behavior Statistics, and Kubernetes Events tab pages are displayed.
Required add-ons have not been installed:
If the npd and prometheus add-ons are not installed, install them as prompted.
After the add-ons are installed, choose again to view the check information.Figure 1 Installing add-ons required for system check
Required add-ons have been installed:
If the add-ons have been installed, you can click the Indicator Check, Behavior Statistics, and Kubernetes Events tabs to view the system check information.
Figure 2 Viewing the system check information
- In the Indicator Check tab page, you can view system resources, system components, abnormal behaviors, and other information, and then perform operations as prompted.
Table 1 Precautions for creating a cluster Check Item
Check Sub-item
Description
System resources
Disk
Node disk usage.
Memory
Node memory usage.
PID
Node PID usage.
System components
CNI
CNI component running status
Docker
Docker component running status
kubelet
kubelet component running status
kube-proxy
kube-proxy component running status
NTP
Docker component running status
Abnormal behavior
Frequent containerd restart
Containerd restarts frequently.
Frequent Docker restart
Docker restarts frequently.
Frequent kubelet restart
kubelet restarts frequently.
Frequent deregistration of network devices
Network devices, such as network adapters, are frequently deregistered.
Others
Ready
Whether the node status is Ready.
- Click the Behavior Statistics tab to view the behavior information and the number of behavior occurrences.
- Click the Kubernetes Event tab to view the event name, event type, number of occurrences, Kubernetes events, first occurrence time, and last occurrence time of the node.
Event data will be retained for 1 hour and then automatically deleted.
Recovery Suggestion
- If system resources are insufficient, expand system resources on the node or increase the upper limit of kernel parameters. If the node cannot be recovered, you can add a taint to the node so that pods will not be scheduled to the node or the pods on the node are evicted to isolate the node.
- A taint can be also added if a system component is abnormal or other exceptions occur.
Reference
- Adding a taint to a node: Taints and Tolerations
- Safe eviction: Safely Drain a Node while Respecting the PodDisruptionBudget
- The following three commands can be used to smoothly migrate services from a node to another node during node maintenance, ensuring that services are not affected:
Table 2 Marking a node as schedulable or unschedulable Command
Function
Usage
cordon
Mark the node as unschedulable.
kubectl cordon {{node-name }}
uncordon
Mark the node as schedulable.
kubectl uncordon {{node-name }}
drain
Mark the node as unschedulable and evict the pods on the node.
kubectl drain {{node-name }}
Last Article: System Steward
Next Article: System Hardening
Did this article solve your problem?
Thank you for your score!Your feedback would help us improve the website.