ALM-12089 Inter-Node Network Is Abnormal

Description

The alarm module checks the network health status of nodes in the cluster every 10 seconds. This alarm is generated when the network between two nodes is unreachable or the network status is unstable.

Attribute

Alarm ID	Alarm Severity	Auto Clear
12089	Major	Yes

Parameters

Name	Meaning
Source	Specifies the cluster or system for which the alarm is generated.
ServiceName	Specifies the service for which the alarm is generated.
RoleName	Specifies the role for which the alarm is generated.
HostName	Specifies the host for which the alarm is generated.

Impact on the System

Data transmission becomes slow or interrupted. Data may be lost or incomplete.
Task scheduling is affected. For example, Yarn tasks cannot be executed properly or fail to be executed due to timeout.
Data processing is affected. For example, HDFS data synchronization fails or the data is inaccurate.
System performance deteriorates. The efficiency and quality of data processing is low.

Possible Causes

The node breaks down.
The network is faulty.

Procedure

Check the network health status.

In the alarm list on FusionInsight Manager, click the drop-down button of the alarm and view Additional Information. Record the source IP address and destination IP address of the node for which the alarm is reported.
Log in to the node for which the alarm is reported. On the node, ping the target node to check whether the network between the two nodes is normal.
- If yes, go to 6.
- If no, go to 3.

Check the node status.

On FusionInsight Manager, click Host and check whether the host list contains the faulty node to determine whether the faulty node has been removed from the cluster.
- If yes, go to 5.
- If no, go to 4.
Check whether the faulty node is powered off.
- If yes, start the faulty node and go to 2.
- If no, contact related personnel to find root cause, if need to remove the faulty nodes from the cluster and go to 5, otherwise go to 6.
Remove the file $NODE_AGENT_HOME/etc/agent/hosts.ini of all nodes in the cluster, and clean up the file /var/log/Bigdata/unreachable/unreachable_ip_info.log, and then manually clear the alarm.
Wait for 30 seconds and checking if the alarm was been cleared.
- If yes, no further action is required.
- If no, go to 7.

Collect fault information.

On the FusionInsight Manager portal, choose O&M > Log > Download.
Select OmmAgent from the Service and click OK.
Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
Contact the O&M personnel and send the collected log information.