Help Center/ Intelligent EdgeFabric/ FAQs/ Edge Node FAQs/ What Do I Do If an Edge Node Is Faulty?
Updated on 2022-12-09 GMT+08:00

What Do I Do If an Edge Node Is Faulty?

Symptom

An edge node is in the Faulty state, and the fault cause is displayed when the cursor is hovered over .

Figure 1 Node fault

Fault Locating

Locate the cause of the edge node fault as follows:

Table 1 Fault locating

Possible Cause

Solution

The edge node is shut down.

Edge Node Is Shut Down

A container engine fault occurs, for example, the container engine is not started or the container engine service is abnormal.

Local Container Engine of the Edge Node Is Abnormal

The node disk space is insufficient.

The network connection of the edge node is abnormal.

Network Connection of the Edge Node Is Abnormal

The GPU driver is abnormal.

GPU Driver Is Abnormal

The NPU plug-in is abnormal.

NPU Plug-in Is Abnormal

The edgecore component installed on the edge node is abnormal.

edgecore Is Abnormal

The edge node enters the recovery mode after being forcibly powered off and then powered on.

System Enters the Recovery Mode

Edge Node Is Shut Down

When the edge node is shut down, it cannot report its status to IEF. In this case, IEF determines that the edge node is faulty. Therefore, keep the edge node running.

You are billed for the number of edge applications not the number of edge nodes. If an edge node is faulty, the edge applications deployed on this node still incur charges even if they are in the abnormal state. Therefore, if you do not need to use services temporarily, delete the corresponding applications from IEF instead of stopping the edge node.

Local Container Engine of the Edge Node Is Abnormal

The startup and running of the IEF core component (edgecore) depend on the container engine. Therefore, if the container engine is abnormal, the edgecore component cannot be started.

Solution

  1. Run docker version to check whether the container engine is normal. If the container engine is abnormal, run systemctl restart docker to restart it.
  2. Run docker ps to check whether the container engine is available. If the container engine is not available, restart or reinstall it.

Do not forcibly power off the edge node. Otherwise, data files on the edge node may be lost or damaged, which can cause node faults.

Container Disk Space of the Edge Node Is Insufficient

Solution

  1. Log in to the edge node. Run the following command to check the usage of the disk mounted to the container running on the edge node:

    df -h

  2. Delete unnecessary files to release the disk space.

    rm File name

/opt/IEF Disk Space of the Edge Node Is Insufficient

Solution

  1. Log in to the edge node. Run the following command to check the usage of the disk space allocated to /opt/IEF:

    df -h

  2. Delete unnecessary files to release the disk space.

    rm File name

/var/IEF/sys/log Disk Space of the Edge Node Is Insufficient

Solution

  1. Log in to the edge node. Run the following command to check the usage of the disk space allocated to /var/IEF/sys/log:

    df -h

  2. Delete unnecessary files to release the disk space.

    rm File name

Network Connection of the Edge Node Is Abnormal

Identification Method

  1. Run the following command on the edge node to obtain the IP address for accessing IEF:

    cat /opt/IEF/Edge-core/conf/edge.yaml | grep ws-url

    Information similar to the following is displayed:

    ws-url: wss://ief2-edgeaccess.cn-north-4.myhuaweicloud.com:443/

    In the preceding command output,

    ief2-edgeaccess.cn-north-4.myhuaweicloud.com indicates the required address. The address varies according to the region. The address format of a platinum service instance is 1fc0704e-229c-4210-9802-75f66aeffe3d.cn-north-4.huaweiief.com. You can also view the address, that is, Access Domain, on the IEF console.

    Figure 2 Viewing the cloud access domain name
  2. Run the curl command to check whether the edge node can connect to IEF.

    curl -i -v -k https://ief2-edgeaccess.cn-north-4.myhuaweicloud.com

    • If no command output is displayed, the network between the edge node and IEF is disconnected.
    • If the information similar to the following is displayed, the network connection is normal:
      * About to connect() to ief2-edgeaccess.cn-north-4.myhuaweicloud.com port 443 (#0)
      *   Trying 49.4.115.239...
      * Connected to ief2-edgeaccess.cn-north-4.myhuaweicloud.com (*.*.*.*) port 443 (#0)
      * Initializing NSS with certpath: sql:/etc/pki/nssdb
      * skipping SSL peer certificate verification
      * NSS: client certificate not found (nickname not specified)
      * SSL connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
      * Server certificate:
      * subject: OID.1.1.1.4=42701fe87611496e80c824778c9857ca,OID.1.1.1.3=op_svc_ief_container1:88125631e95e4d3fbdfa7e6ced0f9dd4,OID.1.1.1.2=cn-north-4:42701fe8761
      1496e80c824778c9857ca:op_cfe_kubelet,OID.1.1.1.1=op_svc_ief_container1,CN=paas.placement.certs.secret OSS3.0 CA,OU=OSS & Service Tools Dept,O="Huawei Technologies 
      Co., Ltd",L=ShenZhen,ST=GuangDong,C=CN
      * start date: Apr 29 16:00:00 2019 GMT
      * expire date: Apr 29 16:00:00 2049 GMT
      * common name: paas.placement.certs.secret OSS3.0 CA
      > GET / HTTP/1.1
      .....

Possible Causes and Solutions

  1. The domain name resolution is abnormal.

    Run the following command to check whether the domain name can be resolved:

    ping ief2-edgeaccess.cn-north-4.myhuaweicloud.com

    If the domain name cannot be resolved into an IP address, run the following command to check whether the DNS server configuration was modified:

    cat /etc/resolv.conf

    Solution:

    • Configure a correct DNS server. The DNS server with IP address 114.114.114.114 is recommended.
    • Obtain the correct IP address resolved from the domain name, and configure the IP address in the host file to temporarily work around this problem.
  2. A proxy problem occurs.

    If the proxy mode is used, check whether the proxy is correctly configured.

    • Check whether a proxy is configured for the edge node.

      Run the following commands:

      env | grep proxy

      env | grep PROXY

    • Check whether a proxy is configured for edgecore.

      Run the following command:

      cat /opt/IEF/Cert/user_config | grep PROXY

    If the proxy mode is not used, run the preceding commands to check that the proxies are configured.

  3. The network connection is not stable.

    Check whether the network connection of the edge node is normal and stable. If the network connection is unstable, the edge node status switches between Faulty and Running.

GPU Driver Is Abnormal

Solution

  1. Install a GPU driver.

    Currently, IEF supports only NVIDIA Tesla P4, P40, and T4 GPUs and the GPU drivers that match CUDA Toolkit 8.0 to 11.0.

    1. Download the GPU driver. The recommended driver link is as follows:

      https://www.nvidia.com/content/DriverDownload-March2009/confirmation.php?url=/tesla/440.33.01/NVIDIA-Linux-x86_64-440.33.01.run&lang=us&type=Tesla

    2. Run the following command to install the GPU driver:

      bash NVIDIA-Linux-x86_64-440.33.01.run

    3. Run the following command to check the GPU driver installation status:

      nvidia-smi

  2. Copy GPU driver files to specific directories.

    1. Log in to the edge node as user root.
    2. Run the following command:

      nvidia-modprobe -c0 -u

    3. Create directories.

      mkdir -p /var/IEF/nvidia/drivers /var/IEF/nvidia/bin /var/IEF/nvidia/lib64

    4. Copy GPU driver files to the directories.
      • For CentOS, run the following commands in sequence to copy the driver files:

        cp /lib/modules/{Kernel version of the current environment}/kernel/drivers/video/nvi* /var/IEF/nvidia/drivers/

        cp /usr/bin/nvidia-* /var/IEF/nvidia/bin/

        cp -rd /usr/lib64/libcuda* /var/IEF/nvidia/lib64/

        cp -rd /usr/lib64/libEG* /var/IEF/nvidia/lib64/

        cp -rd /usr/lib64/libGL* /var/IEF/nvidia/lib64/

        cp -rd /usr/lib64/libnv* /var/IEF/nvidia/lib64/

        cp -rd /usr/lib64/libOpen* /var/IEF/nvidia/lib64/

        cp -rd /usr/lib64/libvdpau_nvidia* /var/IEF/nvidia/lib64/

        cp -rd /usr/lib64/vdpau /var/IEF/nvidia/lib64/

      • For Ubuntu, run the following commands in sequence to copy the driver files:

        cp /lib/modules/{Kernel version of the current environment}/kernel/drivers/video/nvi* /var/IEF/nvidia/drivers/

        cp /usr/bin/nvidia-* /var/IEF/nvidia/bin/

        cp -rd /usr/lib/x86_64-linux-gnu/libcuda* /var/IEF/nvidia/lib64/

        cp -rd /usr/lib/x86_64-linux-gnu/libEG* /var/IEF/nvidia/lib64/

        cp -rd /usr/lib/x86_64-linux-gnu/libGL* /var/IEF/nvidia/lib64/

        cp -rd /usr/lib/x86_64-linux-gnu/libnv* /var/IEF/nvidia/lib64/

        cp -rd /usr/lib/x86_64-linux-gnu/libOpen* /var/IEF/nvidia/lib64/

        cp -rd /usr/lib/x86_64-linux-gnu/libvdpau_nvidia* /var/IEF/nvidia/lib64/

        cp -rd /usr/lib/x86_64-linux-gnu/vdpau /var/IEF/nvidia/lib64/

      You can run the uname -r command to view the kernel version of the current environment, for example, 3.10.0-514.e17.x86_64. Replace the kernel version with the actual value.

      # uname -r
      3.10.0-514.e17.x86_64
    5. Run the following command to change the directory permissions:

      chmod -R 755 /var/IEF

NPU Plug-in Is Abnormal

  1. Log in to the edge node.
  2. Run the following command to check whether the NPU driver container runs properly:

    docker ps -a |grep npu

  3. If the container is not in the Running status, restart the container.

    docker restart {container_name}

    {container_name} indicates the container name.

edgecore Is Abnormal

Check whether the edgecore status is normal.

systemctl status edgecore

If the edgecore component is faulty, the possible causes are as follows:

  • Port 8883 or 1883 is occupied.

    Check whether port 8883 or 1883 of your edge node is occupied. If port 8883 or 1883 is occupied, release the port and run the systemctl restart edgecore command to restore edgecore.

  • The container engine is abnormal.

    Run systemctl status docker to check whether the container engine is normal. If the container engine is abnormal, run systemctl restart docker to restart it.

  • A firewall issue. For details, see Port 8883 Is Disabled by the Firewall.

System Enters the Recovery Mode

If an edge node is forcibly powered off and then powered on, there is a possibility that the system enters the recovery mode. Check whether the /opt/IEF directory is normal. If any file in this directory is lost, the edge node will be faulty.

The /opt/IEF directory is abnormal if any of the following errors occurs:

  • The systemctl status edgecore command output indicates that the edgecore status is abnormal, and the systemctl restart edgecore command output indicates that the edgecore service does not exist.
  • The systemctl status edgelogger command output indicates that the edgelogger status is abnormal, and the systemctl restart edgelogger command output indicates that the edgelogger service does not exist.
  • The systemctl status edgemonitor command output indicates that the edgemonitor status is abnormal, and the systemctl restart edgemonitor command output indicates that the edgemonitor service does not exist.

Solution

Start your edge node in normal mode. If an edge node is powered off abnormally, files on the edge node may be damaged or lost. Therefore, do not perform this operation. If this fault occurs, submit a service ticket.