Help Center/ MapReduce Service/ User Guide (Ankara Region)/ Alarm Reference/ ALM-45640 FlinkServer Heartbeat Interruption Between the Active and Standby Nodes
Updated on 2024-11-29 GMT+08:00

ALM-45640 FlinkServer Heartbeat Interruption Between the Active and Standby Nodes

Alarm Description

This alarm is generated when the FlinkServer active node or standby node does not receive heartbeat messages from the peer for 30 seconds (heartbeat interruption duration configured in keepalive).

This alarm is cleared when the heartbeat recovers.

Alarm Attributes

Alarm ID

Alarm Severity

Alarm Type

Service Type

Auto Cleared

45640

Minor

Heartbeat

Flink

Yes

Alarm Parameters

Type

Parameter

Description

Location Information

Source

Specifies the cluster for which the alarm was generated.

ServiceName

Specifies the service for which the alarm was generated.

RoleName

Specifies the role for which the alarm was generated.

HostName

Specifies the host for which the alarm was generated.

Impact on the System

The impact varies depending on the cause. If the heartbeat is interrupted due to other reasons, for example, network problems, two active nodes may exist because the standby node became the active node. Data synchronization between the active and standby nodes is abnormal, but FlinkServer can still provide services.

Possible Causes

  • The active or standby FlinkServer instance is in the stopped state.
  • The NIC of the floating IP address of the HA system used by the FlinkServer node is incorrectly configured. FlinkServer fails to be started.
  • The link between the active and standby FlinkServer nodes is abnormal.

Handling Procedure

Check the status of the active and standby FlinkServer instances.

  1. Log in to FusionInsight Manager, choose Cluster > Services > Flink > Instance, and check the state of FlinkServer is normal.

    • If yes, go to 3.
    • If no, go to 2.

  2. Select the abnormal FlinkServer instance and start the instance. After the instance is started, check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 3.

Check whether the link between the standby FlinkServer nodes is normal.

  1. Choose Cluster > Services > Flink > Instance, and check the two service IP addresses of FlinkServer.
  2. Log in to the server where the abnormal FlinkServer instance locates as user root.
  3. Run the following command to check whether the server of the other FlinkServer instance is reachable:

    ping IP address of the other FlinkServer instance

    • If yes, go to 8.
    • If no, go to 6.

  4. Ask the network administrator to handle the network exception.
  5. Check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 8.

Check whether the logs of the node where the abnormal FlinkServer instance locates contains error information.

  1. Log in to the server where the abnormal FlinkServer instance locates as user root.
  2. Open the log file in the default directory /var/log/Bigdata/flink/flinkserver/prestart.log and check whether there is error message Float ip x.x.x.x is invalid.

    • If yes, go to 10.
    • If no, go to 12.

  3. On FusionInsight Manager, choose Cluster > Services > Flink > Configurations > All Configurations and search for flink.ha.floatip. Change the parameter value to the correct floating IP address, save the configuration, and restart the Flink service.

    Contact the network engineer to obtain the new floating IP address.

  4. Check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 12.

Collect fault information.

  1. On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
  2. Select the Flink service in the required cluster for Service.
  3. Expand the Hosts drop-down list. In the Select Host dialog box that is displayed, select the hosts to which the role belongs, and click OK.
  4. Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
  5. Contact O&M engineers and provide the collected logs.

Alarm Clearance

This alarm is automatically cleared after the fault is rectified.

Related Information

None.