Help Center/ MapReduce Service/ User Guide/ MRS Cluster O&M/ MRS Cluster Alarm Handling Reference/ ALM-12010 Manager Heartbeat Interruption Between the Active and Standby Nodes
Updated on 2024-09-23 GMT+08:00

ALM-12010 Manager Heartbeat Interruption Between the Active and Standby Nodes

Description

This alarm is generated when the active Mager does not receive the heartbeat signal from the standby Manager within 7 seconds.

This alarm is cleared when the active Manager receives heartbeat signals from the standby Manager.

Attribute

Alarm ID

Alarm Severity

Auto Clear

12010

Major

Yes

Parameters

Name

Meaning

Source

Specifies the cluster or system for which the alarm is generated.

ServiceName

Specifies the service for which the alarm is generated.

RoleName

Specifies the role for which the alarm is generated.

HostName

Specifies the host for which the alarm is generated.

Impact on the System

When the active Manager process is abnormal, the active/standby switchover cannot be performed, affecting basic O&M functions.

Possible Causes

  • The link between the active and standby Manager is abnormal.
  • The node name configuration is incorrect.
  • The port is disabled by the firewall.

Procedure

Check whether the network between the active and standby Manager server is normal.

  1. In the FusionInsight Manager portal, click O&M > Alarm > Alarms, click in the row containing the alarm and view the IP address of the standby Manager (Peer Manager) server in the alarm details.
  2. Log in to the active OMS node as user root.
  3. Run the ping standby Manager heartbeat IP address command to check whether the standby Manager server is reachable.

    • If yes, go to 6.
    • If no, go to 4.

  4. Contact the network administrator to check whether the network is faulty.

    • If yes, go to 5.
    • If no, go to 6.

  5. Rectify the network fault and check whether the alarm is cleared from the alarm list.

    • If yes, no further action is required.
    • If no, go to 6.

Check whether the node name is correctly configured.

  1. Run the following command to go to the software installation directory of the active OMS node:

    cd /opt

  2. Run the following command to find the configuration file directory of the active and standby nodes.

    find -name hacom_local.xml

  3. Run the following command to go to the workspace directory:

    cd${BIGDATA_HOME}/om-server/OMS/workspace0/ha/local/hacom/conf/

  4. Run the vim command to open the hacom_local.xml file. Check whether the local and peer nodes are correctly configured. The local node is configured as the active node, and the peer node is configured as the standby node.

    • If yes, go to 12.
    • If no, go to 10.

  5. Modify the configuration of the active and standby nodes in the hacom_local.xml file and press Esc to return to the command mode. Run the :wq command to save the modification and exit.
  6. Check whether the alarm is cleared automatically.

    • If yes, no further action is required.
    • If no, go to 12.

Check whether the port is disabled by the firewall.

  1. Run the lsof -i :20012 command to check whether the heartbeat ports of the active and standby nodes are enabled. If the command output is displayed, the ports are enabled. Otherwise, the ports are disabled by the firewall.

    • If yes, go to 13.
    • If no, go to 16.

  2. Run the iptables -P INPUT ACCEPT command to avoid the server disconnection.
  3. Run the following command to clear the firewall:

    iptables -F

  4. Check whether the alarm is cleared from the alarm list.

    • If yes, no further action is required.
    • If no, go to 16.

Collect fault information.

  1. On the FusionInsight Manager, choose O&M > Log > Download.
  2. Select the following nodes from the Service and click OK:

    • OmmServer
    • Controller
    • NodeAgent

  3. Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
  4. Contact the O&M personnel and send the collected log information.

Alarm Clearing

After the fault is rectified, the system automatically clears this alarm.

Related Information

None