Updated on 2024-01-17 GMT+08:00

ALM-14012 HDFS Journalnode Data Is Not Synchronized (For MRS 2.x or Earlier)

Description

On the active NameNode, the system checks data synchronization on all JournalNodes in the cluster every 5 minutes. This alarm is generated when data on a JournalNode is not synchronized with that on other JournalNodes.

This alarm is cleared in 5 minutes after data on JournalNodes is synchronized.

Attribute

Alarm ID

Alarm Severity

Auto Clear

14012

Major

Yes

Parameters

Parameter

Description

ServiceName

Specifies the service for which the alarm is generated.

RoleName

Specifies the role for which the alarm is generated.

IP

Specifies the service IP address of the JournalNode instance for which the alarm is generated.

Impact on the System

When a JournalNode is working incorrectly, data on the node is not synchronized with that on other JournalNodes. If data on more than half of JournalNodes is not synchronized, the NameNode cannot work correctly, making the HDFS service unavailable.

Possible Causes

  • The JournalNode instance has not been started or has been stopped.
  • The JournalNode instance is working incorrectly.
  • The network of the JournalNode is unreachable.

Procedure

  1. Check whether the JournalNode instance has been started.

    1. On the MRS cluster details page, click Alarms. In the alarm list, click the alarm.
    2. In the Alarm Details area, check Location and obtain the IP address of the JournalNode for which the alarm is generated.
    3. Choose Components > HDFS > Instances. In the instance list, click the JournalNode for which the alarm is generated and check whether Operating Status of the node is Started.
      • If yes, go to 2.a.
      • If no, go to 1.d.
    4. Select the JournalNode instance and choose More > Start Instance to start it.
    5. Wait 5 minutes and check whether the alarm is cleared.
      • If yes, no further action is required.
      • If no, go to 4.

  2. Check whether the JournalNode instance is working correctly.

    1. Check whether Health Status of the JournalNode instance is Good.
      • If yes, go to 3.a.
      • If no, go to 2.b.
    2. Select the JournalNode instance and choose More > Restart Instance to restart it.
    3. Wait 5 minutes and check whether the alarm is cleared.
      • If yes, no further action is required.
      • If no, go to 4.

  3. Check whether the network of the JournalNode is reachable.

    1. On the MRS cluster details page, choose Components > HDFS > Instances to check the service IP address of the active NameNode.
    2. Log in to the active NameNode.
    3. Run the ping command to check whether a timeout occurs or the network between the active NameNode and the JournalNode is unreachable.

      ping service IP address of the JournalNode

      • If yes, go to 3.d.
      • If no, go to 4.
    4. Contact O&M personnel to rectify the network fault. Wait 5 minutes and check whether the alarm is cleared.
      • If yes, no further action is required.
      • If no, go to 4.

  4. Collect fault information.

    1. On MRS Manager, choose System > Export Log.
    2. Contact the O&M engineers and send the collected logs.

Reference

None