CCE Node Problem Detector
Introduction
CCE Node Problem Detector (NPD) is an add-on that monitors abnormal events of cluster nodes and connects to a third-party monitoring platform. It is a daemon running on each node. It collects node issues from different daemons and reports them to the API server. This add-on can run as a DaemonSet or a daemon.
For more information, see node-problem-detector.
Notes and Constraints
- When using this add-on, do not format or partition node disks.
- Each NPD process occupies 30 m CPU and 100 MiB of memory.
- If the NPD version is 1.18.45 or later, the EulerOS version of the host machine must be 2.5 or later.
Permissions
To monitor kernel logs, the NPD add-on needs to read the host /dev/kmsg. Therefore, the privileged mode must be enabled. For details, see privileged.
In addition, CCE mitigates risks according to the least privilege principle. Only the following privileges are available for NPD running:
- cap_dac_read_search: permission to access /run/log/journal.
- cap_sys_admin: permission to access /dev/kmsg.
Installing the Add-on
- Log in to the CCE console and click the cluster name to access the cluster console. In the navigation pane, choose Add-ons, locate CCE Node Problem Detector on the right, and click Install.
- On the Install Add-on page, configure the specifications as needed.
You can adjust the number of add-on instances and resource quotas as required. High availability is not possible with a single pod. If an error occurs on the node where the add-on instance runs, the add-on will fail.
- Configure the add-on parameters.
Maximum Number of Isolated Nodes in a Fault: specifies the maximum number of nodes that can be isolated to prevent avalanches in case of a fault occurring on multiple nodes. You can configure this parameter either by percentage or quantity.
- Configure deployment policies for the add-on pods.
- Scheduling policies do not take effect on add-on instances of the DaemonSet type.
- When configuring multi-AZ deployment or node affinity, ensure that there are nodes meeting the scheduling policy and that resources are sufficient in the cluster. Otherwise, the add-on cannot run.
Table 1 Configurations for add-on scheduling Parameter
Description
Multi AZ
- Preferred: Deployment pods of the add-on will be preferentially scheduled to nodes in different AZs. If all the nodes in the cluster are deployed in the same AZ, the pods will be scheduled to different nodes in that AZ.
- Equivalent mode: Deployment pods of the add-on are evenly scheduled to the nodes in the cluster in each AZ. If a new AZ is added, you are advised to increase add-on pods for cross-AZ HA deployment. With the Equivalent multi-AZ deployment, the difference between the number of add-on pods in different AZs will be less than or equal to 1. If resources in one of the AZs are insufficient, pods cannot be scheduled to that AZ.
- Required: Deployment pods of the add-on are forcibly scheduled to nodes in different AZs. There can be at most one pod in each AZ. If nodes in a cluster are not in different AZs, some add-on pods cannot run properly. If a node is faulty, add-on pods on it may fail to be migrated.
Node Affinity
- Not configured: Node affinity is disabled for the add-on.
- Node Affinity: Specify the nodes where the add-on is deployed. If you do not specify the nodes, the add-on will be randomly scheduled based on the default cluster scheduling policy.
- Specified Node Pool Scheduling: Specify the node pool where the add-on is deployed. If you do not specify the node pool, the add-on will be randomly scheduled based on the default cluster scheduling policy.
- Custom Policies: Enter the labels of the nodes where the add-on is to be deployed for more flexible scheduling policies. If you do not specify node labels, the add-on will be randomly scheduled based on the default cluster scheduling policy.
If multiple custom affinity policies are configured, ensure that there are nodes that meet all the affinity policies in the cluster. Otherwise, the add-on cannot run.
Toleration
Using both taints and tolerations allows (not forcibly) the add-on Deployment to be scheduled to a node with the matching taints, and controls the Deployment eviction policies after the node where the Deployment is located is tainted.
The add-on adds the default tolerance policy for the node.kubernetes.io/not-ready and node.kubernetes.io/unreachable taints, respectively. The tolerance time window is 60s.
For details, see Configuring Tolerance Policies.
- Click Install.
Components
Component |
Description |
Resource Type |
---|---|---|
node-problem-controller |
Isolate faults basically based on fault detection results. |
Deployment |
node-problem-detector |
Detect node faults. |
DaemonSet |
NPD Check Items
Check items are supported only in 1.16.0 and later versions.
Check items cover events and statuses.
- Event-related
For event-related check items, when a problem occurs, NPD reports an event to the API server. The event type can be Normal (normal event) or Warning (abnormal event).
Table 3 Event-related check items Check Item
Function
Description
OOMKilling
Listen to the kernel logs and check whether OOM events occur and are reported.
Typical scenario: When the memory usage of a process in a container exceeds the limit, OOM is triggered and the process is terminated.
Warning event
Listening object: /dev/kmsg
Matching rule: "Killed process \\d+ (.+) total-vm:\\d+kB, anon-rss:\\d+kB, file-rss:\\d+kB.*"
TaskHung
Listen to the kernel logs and check whether taskHung events occur and are reported.
Typical scenario: Disk I/O suspension causes process suspension.
Warning event
Listening object: /dev/kmsg
Matching rule: "task \\S+:\\w+ blocked for more than \\w+ seconds\\."
ReadonlyFilesystem
Check whether the Remount root filesystem read-only error occurs in the system kernel by listening to the kernel logs.
Typical scenario: A user detaches a data disk from a node by mistake on the ECS, and applications continuously write data to the mount point of the data disk. As a result, an I/O error occurs in the kernel and the disk is remounted as a read-only disk.
NOTE:If the rootfs of node pods is of the device mapper type, an error will occur in the thin pool if a data disk is detached. This will affect NPD and NPD will not be able to detect node faults.
Warning event
Listening object: /dev/kmsg
Matching rule: Remounting filesystem read-only
- Status-related
For status-related check items, when a problem occurs, NPD reports an event to the API server and changes the node status synchronously. This function can be used together with Node-problem-controller fault isolation to isolate nodes.
If the check period is not specified in the following check items, the default period is 30 seconds.
Table 5 Checking system metrics Check Item
Function
Description
Conntrack table full
ConntrackFullProblem
Check whether the conntrack table is full.
- Default threshold: 90%
- Usage: nf_conntrack_count
- Maximum value: nf_conntrack_max
Insufficient disk resources
DiskProblem
Check the usage of the system disk and CCE data disks (including the CRI logical disk and kubelet logical disk) on the node.
- Default threshold: 90%
- Source:
df -h
Currently, additional data disks are not supported.
Insufficient file handles
FDProblem
Check if the FD file handles are used up.
- Default threshold: 90%
- Usage: the first value in /proc/sys/fs/file-nr
- Maximum value: the third value in /proc/sys/fs/file-nr
Insufficient node memory
MemoryProblem
Check whether memory is used up.
- Default threshold: 80%
- Usage: MemTotal-MemAvailable in /proc/meminfo
- Maximum value: MemTotal in /proc/meminfo
Insufficient process resources
PIDProblem
Check whether PID process resources are exhausted.
- Default threshold: 90%
- Usage: nr_threads in /proc/loadavg
- Maximum value: smaller value between /proc/sys/kernel/pid_max and /proc/sys/kernel/threads-max.
Table 7 Other check items Check Item
Function
Description
Abnormal NTP
NTPProblem
Check whether the node clock synchronization service ntpd or chronyd is running properly and whether a system time drift is caused.
Default clock offset threshold: 8000 ms
Process D error
ProcessD
Check whether there is a process D on the node.
Default threshold: 10 abnormal processes detected for three consecutive times
Source:
- /proc/{PID}/stat
- Alternately, you can run the ps aux command.
Exceptional scenario: The ProcessD check item ignores the resident D processes (heartbeat and update) on which the SDI driver on the BMS node depends.
Process Z error
ProcessZ
Check whether the node has processes in Z state.
ResolvConf error
ResolvConfFileProblem
Check whether the ResolvConf file is lost.
Check whether the ResolvConf file is normal.
Exceptional definition: No upstream domain name resolution server (nameserver) is included.
Object: /etc/resolv.conf
Existing scheduled event
ScheduledEvent
Check whether scheduled live migration events exist on the node. A live migration plan event is usually triggered by a hardware fault and is an automatic fault rectification method at the IaaS layer.
Typical scenario: The host is faulty. For example, the fan is damaged or the disk has bad sectors. As a result, live migration is triggered for VMs.
Source:
- http://169.254.169.254/meta-data/latest/events/scheduled
This check item is an Alpha feature and is disabled by default.
The kubelet component has the following default check items, which have bugs or defects. You can fix them by upgrading the cluster or using NPD.
Table 8 Default kubelet check items Check Item
Function
Description
Insufficient PID resources
PIDPressure
Check whether PIDs are sufficient.
- Interval: 10 seconds
- Threshold: 90%
- Defect: In community version 1.23.1 and earlier versions, this check item becomes invalid when over 65535 PIDs are used. For details, see issue 107107. In community version 1.24 and earlier versions, thread-max is not considered in this check item.
Insufficient memory
MemoryPressure
Check whether the allocable memory for the containers is sufficient.
- Interval: 10 seconds
- Threshold: max. 100 MiB
- Allocable = Total memory of a node – Reserved memory of a node
- Defect: This check item checks only the memory consumed by containers, and does not consider that consumed by other elements on the node.
Insufficient disk resources
DiskPressure
Check the disk usage and inodes usage of the kubelet and Docker disks.
- Interval: 10 seconds
- Threshold: 90%
Node-problem-controller Fault Isolation
Fault isolation is supported only by add-ons of 1.16.0 and later versions.
By default, if multiple nodes become faulty, NPC adds taints to up to 10% of the nodes. You can set npc.maxTaintedNode to increase the threshold.
The open source NPD plugin provides fault detection but not fault isolation. CCE enhances the node-problem-controller (NPC) based on the open source NPD. This component is implemented based on the Kubernetes node controller. For faults reported by NPD, NPC automatically adds taints to nodes for node fault isolation.
Parameter |
Description |
Default |
---|---|---|
npc.enable |
Whether to enable NPC This parameter is not supported in 1.18.0 or later versions. |
true |
npc.maxTaintedNode |
The maximum number of nodes that NPC can add taints to when an individual fault occurs on multiple nodes for minimizing impact. The value can be in int or percentage format. |
10% Value range:
|
npc.nodeAffinity |
Node affinity of the controller |
N/A |
Viewing NPD Events
Events reported by the NPD add-on can be queried on the Nodes page.
- Log in to the CCE console.
- Click the cluster name to access the cluster console. Choose Nodes in the navigation pane.
- Locate the row that contains the target node, and click View Events.
Figure 1 Viewing node events
Collecting Prometheus Metrics
The NPD daemon pod exposes Prometheus metric data on port 19901. By default, the NPD pod is added with the annotation metrics.alpha.kubernetes.io/custom-endpoints: '[{"api":"prometheus","path":"/metrics","port":"19901","names":""}]'. You can build a Prometheus collector to identify and obtain NPD metrics from http://{{NpdPodIP}}:{{NpdPodPort}}/metrics.
If the NPD add-on version is earlier than 1.16.5, the exposed port of Prometheus metrics is 20257.
Currently, the metric data includes problem_counter and problem_gauge, as shown below.
# HELP problem_counter Number of times a specific type of problem have occurred. # TYPE problem_counter counter problem_counter{reason="DockerHung"} 0 problem_counter{reason="DockerStart"} 0 problem_counter{reason="EmptyDirVolumeGroupStatusError"} 0 ... # HELP problem_gauge Whether a specific type of problem is affecting the node or not. # TYPE problem_gauge gauge problem_gauge{reason="CNIIsDown",type="CNIProblem"} 0 problem_gauge{reason="CNIIsUp",type="CNIProblem"} 0 problem_gauge{reason="CRIIsDown",type="CRIProblem"} 0 problem_gauge{reason="CRIIsUp",type="CRIProblem"} 0 ..
Change History
Add-on Version |
Supported Cluster Version |
New Feature |
Community Version |
---|---|---|---|
1.19.1 |
v1.21 v1.23 v1.25 v1.27 v1.28 v1.29 |
Fixed some issues. |
|
1.18.46 |
v1.21 v1.23 v1.25 v1.27 v1.28 |
CCE clusters 1.28 are supported. |
|
1.18.22 |
v1.19 v1.21 v1.23 v1.25 v1.27 |
None |
|
1.18.14 |
v1.19 v1.21 v1.23 v1.25 |
|
|
1.18.10 |
v1.19 v1.21 v1.23 v1.25 |
|
|
1.17.4 |
v1.17 v1.19 v1.21 v1.23 v1.25 |
Optimizes DiskHung check item. |
|
1.17.3 |
v1.17 v1.19 v1.21 v1.23 v1.25 |
|
|
1.17.2 |
v1.17 v1.19 v1.21 v1.23 v1.25 |
|
|
1.16.4 |
v1.17 v1.19 v1.21 v1.23 |
|
|
1.16.3 |
v1.17 v1.19 v1.21 v1.23 |
Adds the function of checking the ResolvConf configuration file. |
|
1.16.1 |
v1.17 v1.19 v1.21 v1.23 |
|
|
1.15.0 |
v1.17 v1.19 v1.21 v1.23 |
|
|
1.14.11 |
v1.17 v1.19 v1.21 |
CCE clusters 1.21 are supported. |
|
1.14.5 |
v1.17 v1.19 |
Fixes the issue that monitoring metrics cannot be obtained. |
|
1.14.4 |
v1.17 v1.19 |
|
|
1.14.2 |
v1.17 v1.19 |
|
|
1.13.8 |
v1.15.11 v1.17 |
|
|
1.13.6 |
v1.15.11 v1.17 |
Fixes the issue that zombie processes are not reclaimed. |
|
1.13.5 |
v1.15.11 v1.17 |
Added taint tolerance configuration. |
|
1.13.2 |
v1.15.11 v1.17 |
Added resource limits and enhanced the detection capability of the cni add-on. |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.