ALM-14012 JournalNode Is Out of Synchronization
Description
On the active NameNode, the system checks the data consistency of all JournalNodes in the cluster every 5 minutes. This alarm is generated when the data on a JournalNode is inconsistent with the data on the other JournalNodes.
This alarm is cleared in 5 minutes after the data on JournalNodes is consistent.
Attribute
Alarm ID |
Alarm Severity |
Automatically Cleared |
---|---|---|
14012 |
Major |
Yes |
Parameters
Name |
Meaning |
---|---|
Source |
Specifies the cluster for which the alarm is generated. |
ServiceName |
Specifies the service for which the alarm is generated. |
RoleName |
Specifies the role for which the alarm is generated. |
HostName |
Specifies the host for which the alarm is generated. |
NameServiceName |
Specifies the NameService for which the alarm is generated. |
Impact on the System
When a JournalNode is working incorrectly, the data on the node becomes inconsistent with that on the other JournalNodes. If data on more than half of JournalNodes is inconsistent, the NameNode cannot work correctly, making the HDFS service unavailable.
Possible Causes
- The JournalNode instance does not exist (deleted or migrated).
- The JournalNode instance has not been started or has been stopped.
- The JournalNode instance is working incorrectly.
- The network of the JournalNode is unreachable.
Procedure
Check whether the JournalNode instance has been started up.
- On the FusionInsight Manager portal, choose O&M > Alarm > Alarms. In the alarm list, click the alarm.
- Check Location and obtain the IP address of the JournalNode for which the alarm is generated.
- Choose Cluster > Name of the desired cluster > Services > HDFS > Instance. In the instance list, check whether the JournalNode instance exists on the node for which the alarm is generated.
- Choose O&M > Alarm > Alarms. In the alarm list, click Clear in the Operation column of the alarm. In the dialog box that is displayed, click OK. No further action is needed.
- Click the JournalNode instance and check whether its Configuration Status is Synchronized.
- Select the JournalNode instance and choose Start Instance to start the instance.
- After 5 minutes, check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 15.
Check whether the JournalNode instance is working correctly.
- Check whether Running Status of the JournalNode instance is Normal.
- Select the JournalNode instance and choose More > Restart Instance to start the instance.
- After 5 minutes, check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 15.
Check whether the network of the JournalNode is reachable.
- On the FusionInsight Manager portal, choose Cluster > Name of the desired cluster > Services > HDFS > Instance to check the service IP address of the active NameNode.
- Log in to the active NameNode as user root.
- Run the ping command to check whether a timeout occurs or the network is unreachable between the active NameNode and the JournalNode.
ping service IP address of the JournalNode
- Contact the network administrator to rectify the network fault and check whether the alarm is cleared 5 minutes later.
- If yes, no further action is required.
- If no, go to 15.
Collect fault information.
- On the FusionInsight Manager portal, choose O&M > Log > Download.
- Select HDFS in the required cluster from the Service.
- Click in the upper right corner, and set Start Date and End Date for log collection to 30 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
- Contact the O&M personnel and send the collected logs.
Alarm Clearing
After the fault is rectified, the system automatically clears this alarm.
Related Information
None
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot