Help Center/ MapReduce Service/ User Guide (Ankara Region)/ Alarm Reference/ ALM-12089 Network Connections Between Nodes Are Abnormal
Updated on 2024-11-29 GMT+08:00

ALM-12089 Network Connections Between Nodes Are Abnormal

Alarm Description

The alarm module checks the network health status of nodes in the cluster every 10 seconds. This alarm is generated when the network between two nodes is unreachable or the network status is unstable.

This alarm is cleared when the network recovers.

Alarm Attributes

Alarm ID

Alarm Severity

Alarm Type

Service Type

Auto Cleared

12089

Major

Communications

FusionInsight Manager

Yes

Alarm Parameters

Type

Parameter

Description

Location Information

Source

Specifies the cluster or system for which the alarm was generated.

ServiceName

Specifies the service for which the alarm was generated.

RoleName

Specifies the role for which the alarm was generated.

HostName

Specifies the host for which the alarm was generated.

Additional Information

Trigger condition

Specifies the trigger condition of the alarm.

Impact on the System

  • Data transmission becomes slow or interrupted. Data may be lost or incomplete.
  • Task scheduling is affected. For example, Yarn tasks cannot be executed properly or fail to be executed due to timeout.
  • Data processing is affected. For example, HDFS data synchronization fails or the data is inaccurate.
  • System performance deteriorates. The efficiency and quality of data processing is low.

Possible Causes

  • A node breaks down.
  • The network is faulty.

Handling Procedure

Check the network health status.

  1. In the alarm list on FusionInsight Manager, locate the row that contains the alarm, click , and view the description in additional information. Record the source IP address and destination IP address of the node for which the alarm is reported.
  2. Log in to the node for which the alarm is reported . On the node, ping the target node to check whether the network between the two nodes is normal.

    • If yes, go to 6.
    • If no, go to 3.

Check the node status.

  1. On FusionInsight Manager, click Host and check whether the host list contains the faulty node to determine whether the faulty node has been removed from the cluster.

    • If yes, go to 5.
    • If no, go to 4.

  2. Check whether the faulty node is powered off.

    • If yes, start the node and go to 2.
    • If no, contact the engineer in charge to locate the fault. If you need to remove the faulty node from the cluster, go to 5. If you do not need, go to 6.

  3. Remove the faulty node from the $NODE_AGENT_HOME/etc/agent/hosts.ini file on all nodes in the cluster, clear the /var/log/Bigdata/unreachable/unreachable_ip_info.log file, and clear the alarm.
  4. Wait 30 seconds, check whether the alarm is automatically cleared.

    • If yes, no further action is required.
    • If no, go to 7.

Collect fault information.

  1. On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
  2. Expand the Service drop-down list, select OmmAgent for the target cluster, and click OK.
  3. Click the edit icon in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
  4. Contact O&M engineers and provide the collected logs.

Alarm Clearance

This alarm is automatically cleared after the fault is rectified.

Related Information

None.