System Check

Scenario

System Steward consists of system check and system hardening. This topic describes the system check function.

System check detects faults or exceptions on nodes in real time.

Prerequisites

  • Before using the system check function, you must install the npd add-on, which is used to detect node exceptions.
  • Before using the system check function, you must install the prometheus add-on, which is used to obtain abnormal metrics reported by the npd add-on.

Procedure

  1. Log in to the CCE console. In the navigation pane on the left, choose System Steward > System Check.
  2. In the left pane of the System Check page, choose the node for which you want to perform a system check. The Indicator Check, Behavior Statistics, and Kubernetes Events tab pages are displayed.

    Required add-ons have not been installed:

    If the npd and prometheus add-ons are not installed, install them as prompted.

    After the add-ons are installed, choose System Steward > System Check again to view the check information.
    Figure 1 Installing add-ons required for system check

    Required add-ons have been installed:

    If the add-ons have been installed, you can click the Indicator Check, Behavior Statistics, and Kubernetes Events tabs to view the system check information.

    Figure 2 Viewing the system check information

  3. In the Indicator Check tab page, you can view system resources, system components, abnormal behaviors, and other information, and then perform operations as prompted.

    Table 1 Precautions for creating a cluster

    Check Item

    Check Sub-item

    Description

    System resources

    Disk

    Node disk usage.

    Memory

    Node memory usage.

    PID

    Node PID usage.

    System components

    CNI

    CNI component running status

    Docker

    Docker component running status

    kubelet

    kubelet component running status

    kube-proxy

    kube-proxy component running status

    NTP

    Docker component running status

    Abnormal behavior

    Frequent containerd restart

    Containerd restarts frequently.

    Frequent Docker restart

    Docker restarts frequently.

    Frequent kubelet restart

    kubelet restarts frequently.

    Frequent deregistration of network devices

    Network devices, such as network adapters, are frequently deregistered.

    Others

    Ready

    Whether the node status is Ready.

  4. Click the Behavior Statistics tab to view the behavior information and the number of behavior occurrences.
  5. Click the Kubernetes Event tab to view the event name, event type, number of occurrences, Kubernetes events, first occurrence time, and last occurrence time of the node.

    Event data will be retained for 1 hour and then automatically deleted.

Recovery Suggestion

  • If system resources are insufficient, expand system resources on the node or increase the upper limit of kernel parameters. If the node cannot be recovered, you can add a taint to the node so that pods will not be scheduled to the node or the pods on the node are evicted to isolate the node.
  • A taint can be also added if a system component is abnormal or other exceptions occur.

Reference

  • Adding a taint to a node: Taints and Tolerations
  • Safe eviction: Safely Drain a Node while Respecting the PodDisruptionBudget
  • The following three commands can be used to smoothly migrate services from a node to another node during node maintenance, ensuring that services are not affected:
    Table 2 Marking a node as schedulable or unschedulable

    Command

    Function

    Usage

    cordon

    Mark the node as unschedulable.

    kubectl cordon {{node-name }}

    uncordon

    Mark the node as schedulable.

    kubectl uncordon {{node-name }}

    drain

    Mark the node as unschedulable and evict the pods on the node.

    kubectl drain {{node-name }}