Configuring Node Fault Detection Policies

Node fault detection depends on the CCE Node Problem Detector add-on (CCE Node Problem Detector). The add-on instance runs on each node to monitor node faults. This section describes how to enable node fault detection.

Prerequisites

The CCE Node Problem Detector add-on (CCE Node Problem Detector) add-on has been installed in the cluster.

Enabling Node Fault Detection

Log in to the CCE console and click the cluster name to access the cluster console.
In the navigation pane, choose Nodes. Then click the Nodes tab. Verify that the CCE Node Problem Detector add-on is installed in the cluster and updated to the latest version. Fault detection will then be available.
When this add-on is running normally, click Node Fault Detection Policy to check the current fault detection items. For more details, see NPD Check Items.
Check the node list for any abnormal metrics.
Click Abnormal metrics and rectify the fault as prompted.

Customized Check Items

Log in to the CCE console and click the cluster name to access the cluster console.
In the navigation pane, choose Nodes and then click the Nodes tab. Then, click Fault Detection Policy.

On the displayed page, view the current check items. Click Edit in the Operation column and edit checks.

Currently, the following configurations are supported:

Enable/Disable: Enable or disable a check item.
Target Node: By default, check items are executed on all nodes. You can add the node label to filter the node that meets all conditions.
Check Period: The default check period is 30 seconds. You can change the value as required.
Trigger: The CCE Node Problem Detector add-on provides the default threshold to match common fault scenarios. You can change the threshold as required. The threshold varies depending on check items, such as the number of failures and resource usage percentage. You can adjust the threshold as required. For example, you can change the threshold of resource usage percentage from 90% to 80%.

Troubleshooting Strategy: After a fault occurs, you can select the strategies listed in the following table as needed.

**Table 1** Troubleshooting strategies
Troubleshooting Strategy	Effect
Prompting Exception	Kubernetes events are reported.
Disabling scheduling	Kubernetes events are reported and the NoSchedule taint is added to the node.
Evict Node Load	Kubernetes events are reported and the NoExecute taint is added to the node. This operation will evict workloads on the node and interrupt services. Exercise caution when performing this operation.

NPD Check Items

Check items are supported only in 1.16.0 and later versions.

Check items cover events and statuses.

Event-related

For event-related check items, when a problem occurs, NPD reports an event to the API server. The event type can be Normal (normal event) or Warning (abnormal event).

**Table 2** Event-related check items
Check Item	Function	Description
OOMKilling	Listen to the kernel logs and check whether OOM events occur and are reported. Typical scenario: The memory used by the process in the container exceeds the limit, triggering OOM and terminating the process.	Warning event Listening object: /dev/kmsg Matching rule: "Killed process \\d+ (.+) total-vm:\\d+kB, anon-rss:\\d+kB, file-rss:\\d+kB.*"
TaskHung	Listen to the kernel logs and check whether taskHung events occur and are reported. Typical scenario: Disk I/O suspension causes process suspension.	Warning event Listening object: /dev/kmsg Matching rule: "task \\S+:\\w+ blocked for more than \\w+ seconds\\."
ReadonlyFilesystem	Check whether the Remount root filesystem read-only error occurs in the system kernel by listening to the kernel logs. Typical scenario: A user detaches a data disk from a node by mistake on the ECS, and applications continuously write data to the mount point of the data disk. As a result, an I/O error occurs in the kernel and the disk is remounted as a read-only disk. NOTE: If the rootfs of node pods is of the device mapper type, an error will occur in the thin pool if a data disk is detached. This will affect NPD and NPD will not be able to detect node faults.	Warning event Listening object: /dev/kmsg Matching rule: Remounting filesystem read-only

Status-related

For status-related check items, when a problem occurs, NPD reports an event to the API server and changes the node status synchronously. This function can be used together with Node-problem-controller fault isolation to isolate nodes.

If the check period is not specified in the following check items, the default period is 30 seconds.

**Table 3** Checking system components
Check Item	Function	Description
Container network component error CNIProblem	Check the status of the CNI components (container network components).	None
Container runtime component error CRIProblem	Check the status of Docker and containerd of the CRI components (container runtime components).	Check object: Docker or containerd
Frequent restarts of Kubelet FrequentKubeletRestart	Periodically backtrack system logs to check whether the key component Kubelet restarts frequently.	Default threshold: 10 restarts within 10 minutes If Kubelet restarts for 10 times within 10 minutes, it indicates that the system restarts frequently and a fault alarm is generated. Listening object: logs in the /run/log/journal directory
Frequent restarts of Docker FrequentDockerRestart	Periodically backtrack system logs to check whether the container runtime Docker restarts frequently.
Frequent restarts of containerd FrequentContainerdRestart	Periodically backtrack system logs to check whether the container runtime containerd restarts frequently.
kubelet error KubeletProblem	Check the status of the key component Kubelet.	None
kube-proxy error KubeProxyProblem	Check the status of the key component kube-proxy.	None

**Table 4** Checking system metrics
Check Item	Function	Description
Conntrack table full ConntrackFullProblem	Check whether the conntrack table is full.	Default threshold: 90% Usage: nf_conntrack_count Maximum value: nf_conntrack_max
Insufficient disk resources DiskProblem	Check the usage of the system disk and CCE data disks (including the CRI logical disk and kubelet logical disk) on the node.	Default threshold: 90% Source: df -h Currently, additional data disks are not supported.
Insufficient file handles FDProblem	Check if the FD file handles are used up.	Default threshold: 90% Usage: the first value in /proc/sys/fs/file-nr Maximum value: the third value in /proc/sys/fs/file-nr
Insufficient node memory MemoryProblem	Check whether memory is used up.	Default threshold: 80% Usage: MemTotal-MemAvailable in /proc/meminfo Maximum value: MemTotal in /proc/meminfo
Insufficient process resources PIDProblem	Check whether PID process resources are exhausted.	Default threshold: 90% Usage: denominator of the fourth value in /proc/loadavg, which indicates the total number of processes that can run Maximum value: smaller value between /proc/sys/kernel/pid_max and /proc/sys/kernel/threads-max.

**Table 5** Checking the storage
Check Item	Function	Description
Disk read-only DiskReadonly	Periodically perform write tests on the system disk and CCE data disks (including the CRI logical disk and Kubelet logical disk) of the node to check the availability of key disks.	Detection paths: /mnt/paas/kubernetes/kubelet/ /var/lib/docker/ /var/lib/containerd/ /var/paas/sys/log/cceaddon-npd/ The temporary file npd-disk-write-ping is generated in the detection path. Currently, additional data disks are not supported.
emptyDir storage pool error EmptyDirVolumeGroupStatusError	Check whether the ephemeral volume group on the node is normal. Impact: Pods that depend on the storage pool cannot write data to the temporary volume. The temporary volume is remounted as a read-only file system by the kernel due to an I/O error. Typical scenario: When creating a node, a user configures two data disks as an ephemeral volume storage pool. Some data disks are deleted by mistake. As a result, the storage pool becomes abnormal.	Detection period: 30s Source: vgs -o vg_name, vg_attr Principle: Check whether the VG (storage pool) is in the P state. If yes, some PVs (data disks) are lost. Joint scheduling: The scheduler can automatically identify a PV storage pool error and prevent pods that depend on the storage pool from being scheduled to the node. Exceptional scenario: The NPD add-on cannot detect the loss of all PVs (data disks), resulting in the loss of VGs (storage pools). In this case, kubelet automatically isolates the node, detects the loss of VGs (storage pools), and updates the corresponding resources in nodestatus.allocatable to 0. This prevents pods that depend on the storage pool from being scheduled to the node. The damage of a single PV cannot be detected by this check item, but by the ReadonlyFilesystem check item.
PV storage pool error LocalPvVolumeGroupStatusError	Check the PV group on the node. Impact: Pods that depend on the storage pool cannot write data to the persistent volume. The persistent volume is remounted as a read-only file system by the kernel due to an I/O error. Typical scenario: When creating a node, a user configures two data disks as a persistent volume storage pool. Some data disks are deleted by mistake.
Mount point error MountPointProblem	Check the mount point on the node. Definition: You cannot access the mount point by running the cd command. Typical scenario: Network File System (NFS), for example, obsfs and s3fs is mounted to a node. When the connection is abnormal due to network or peer NFS server exceptions, all processes that access the mount point are suspended. For example, during a cluster upgrade, a kubelet is restarted, and all mount points are scanned. If the abnormal mount point is detected, the upgrade fails.	Alternatively, you can run the following command: for dir in `df -h \| grep -v "Mounted on" \| awk "{print \\$NF}"`;do cd $dir; done && echo "ok"
Suspended disk I/O DiskHung	Check whether I/O suspension occurs on all disks on the node, that is, whether I/O read and write operations are not responded. Definition of I/O suspension: The system does not respond to disk I/O requests, and some processes are in the D state. Typical scenario: Disks cannot respond due to abnormal OS hard disk drivers or severe faults on the underlying network.	Check object: all data disks Source: /proc/diskstat Alternatively, you can run the following command: iostat -xmt 1 Thresholds: (All following conditions must be met.) Average usage (ioutil) ≥ 0.99 Average I/O queue length (avgqu-sz) ≥ 1 Average I/O transfer volume ≤ 1 Average I/O transfer volume = Number of writes completed per second (iops, unit: w/s) + Amount of data written per second (ioth, unit: wMB/s) NOTE: In some OSs, no data changes during I/O. In this case, calculate the CPU I/O time usage. The value of iowait should be greater than 0.8.
Slow disk I/O DiskSlow	Check whether all disks on the node have slow I/Os, that is, whether I/Os respond slowly. Typical scenario: EVS disks have slow I/Os due to network fluctuation.	Check object: all data disks Source: /proc/diskstat Alternatively, you can run the following command: iostat -xmt 1 Default threshold: Average I/O latency (await) ≥ 5000 ms NOTE: If I/O requests are not responded and the await data is not updated, this check item is invalid.

**Table 6** Other check items
Check Item	Function	Description
Abnormal NTP NTPProblem	Check whether the node clock synchronization service ntpd or chronyd is running properly and whether a system time drift is caused.	Default clock offset threshold: 8000 ms
Process D error ProcessD	Check whether there is a process D on the node.	Default threshold: 10 abnormal processes detected for three consecutive times Source: /proc/{PID}/stat Alternately, you can run the ps aux command.
Process Z error ProcessZ	Check whether the node has processes in Z state.
ResolvConf error ResolvConfFileProblem	Check whether the ResolvConf file is lost. Check whether the ResolvConf file is normal. Definition: No upstream domain name resolution server (nameserver) is included.	Object: /etc/resolv.conf
Existing scheduled event ScheduledEvent	Check whether scheduled live migration events exist on the node. A live migration plan event is usually triggered by a hardware fault and is an automatic fault rectification method at the IaaS layer. Typical scenario: The host is faulty. For example, the fan is damaged or the disk has bad sectors. As a result, live migration is triggered for VMs.	Source: http://169.254.169.254/meta-data/latest/events/scheduled This check item is an Alpha feature and is disabled by default.

The kubelet component has the following default check items, which have bugs or defects. You can fix them by upgrading the cluster or using NPD.

**Table 7** Default kubelet check items
Check Item	Function	Description
Insufficient PID resources PIDPressure	Check whether PIDs are sufficient.	Interval: 10 seconds Threshold: 90% Defect: In community version 1.23.1 and earlier versions, this check item becomes invalid when over 65535 PIDs are used. For details, see issue 107107. In community version 1.24 and earlier versions, thread-max is not considered in this check item.
Insufficient memory MemoryPressure	Check whether the allocable memory for the containers is sufficient.	Interval: 10 seconds Threshold: Maximum value – 100 MiB Allocable = Total memory of a node – Reserved memory of a node Defect: This check item checks only the memory consumed by containers, and does not consider that consumed by other elements on the node.
Insufficient disk resources DiskPressure	Check the disk usage and inodes usage of the kubelet and Docker disks.	Interval: 10 seconds Threshold: 90%