ALM-12089 Inter-Node Network Is Abnormal
Alarm Description
The alarm module checks the network health status of nodes in the cluster every 10 seconds. This alarm is generated when the network between two nodes is unreachable or the network status is unstable.
This alarm is cleared when the network recovers.
Alarm Attributes
Alarm ID |
Alarm Severity |
Auto Cleared |
---|---|---|
12089 |
Major |
Yes |
Alarm Parameters
Parameter |
Description |
---|---|
Source |
Specifies the cluster or system for which the alarm is generated. |
ServiceName |
Specifies the service for which the alarm is generated. |
RoleName |
Specifies the role for which the alarm is generated. |
HostName |
Specifies the host for which the alarm is generated. |
Impact on the System
- Data transmission becomes slow or interrupted. Data may be lost or incomplete.
- Task scheduling is affected. For example, Yarn tasks cannot be executed properly or fail to be executed due to timeout.
- Data processing is affected. For example, HDFS data synchronization fails or the data is inaccurate.
- System performance deteriorates. The efficiency and quality of data processing is low.
Possible Causes
- The node breaks down.
- The network is faulty.
Handling Procedure
Check the network health status.
- In the alarm list on FusionInsight Manager, locate the row that contains the alarm, click
to view the description in additional information. Record the source IP address and destination IP address of the node for which the alarm is reported.
- Log in to the node that the alarm is generated for. On the node, run ping the destination node to check whether the network between the two nodes is normal.
For details about how to log in to a cluster node, see Logging In to an MRS Cluster Node.
ping IP address of the destination node
Check the node status.
- On FusionInsight Manager, click Host and check whether the host list contains the faulty node to determine whether the faulty node has been removed from the cluster.
- Check whether the faulty node is powered off.
- Remove the faulty node from the $NODE_AGENT_HOME/etc/agent/hosts.ini file on all nodes in the cluster, clear the /var/log/Bigdata/unreachable/unreachable_ip_info.log file, and clear the alarm.
- Wait 30 seconds and check whether the alarm is automatically cleared.
- If yes, no further action is required.
- If no, go to 7.
Collect fault information.
- On the FusionInsight Manager portal, choose O&M > Log > Download.
- Select OmmAgent from the Service and click OK.
- Click
in the upper right corner to set the log collection time range. Generally, the time range is 10 seconds before and after the alarm generation time. Click Download.
- Contact the O&M personnel and send the collected log information.
Alarm Clearance
After the fault is rectified, the system automatically clears this alarm.
Related Information
None
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot