Help Center> MapReduce Service> User Guide> Alarm Reference (Applicable to MRS 3.x)> ALM-45640 FlinkServer Heartbeat Interruption Between the Active and Standby Nodes
Updated on 2024-04-18 GMT+08:00

ALM-45640 FlinkServer Heartbeat Interruption Between the Active and Standby Nodes

This section applies to MRS 3.2.0 or later.

Alarm Description

This alarm is generated when the FlinkServer active node or standby node does not receive heartbeat messages from the peer for 30 seconds (heartbeat interruption duration configured in keepalive).

This alarm is cleared when the heartbeat recovers.

Alarm Attributes

Alarm ID

Alarm Severity

Auto Cleared

45640

Minor

Yes

Alarm Parameters

Parameter

Description

Source

Specifies the cluster for which the alarm was generated.

ServiceName

Specifies the service for which the alarm was generated.

RoleName

Specifies the role for which the alarm was generated.

HostName

Specifies the host for which the alarm was generated.

Impact on the System

During the FlinkServer heartbeat interruption, only one node can provide the service. If this node is faulty, no standby node is available for failover and the service is unavailable.

Possible Causes

  • The active or standby FlinkServer instance is in the stopped state.
  • The NIC of the floating IP address of the HA system used by the FlinkServer node is incorrectly configured. FlinkServer fails to be started.
  • The link between the active and standby FlinkServer nodes is abnormal.

Handling Procedure

Check the status of the active and standby FlinkServer instances.

  1. Log in to FusionInsight Manager, choose Cluster > Services > Flink > Instance, and check the state of FlinkServer is normal.

    • If yes, go to 3.
    • If no, go to 2.

  2. Select the abnormal FlinkServer instance and start the instance. After the instance is started, check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 3.

    During the restart, the FlinkServer instance cannot provide services, but submitted jobs are not affected.

Check whether the link between the standby FlinkServer nodes is normal.

  1. Choose Cluster > Services > Flink > Instance, and check the two service IP addresses of FlinkServer.
  2. Log in to the server where the abnormal FlinkServer instance locates as the root user.
  3. Run the following command to check whether the server of the other FlinkServer instance is reachable:

    ping IP address of the other FlinkServer instance

    • If yes, go to 8.
    • If no, go to 6.

  4. Ask the network administrator to handle the network exception.
  5. Check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 8.

Check whether the logs of the node where the abnormal FlinkServer instance locates contains error information.

  1. Log in to the server where the abnormal FlinkServer instance locates as the root user.
  2. Open the log file in the default directory /var/log/Bigdata/flink/flinkserver/prestart.log and check whether there is error message Float ip x.x.x.x is invalid.

    • If yes, go to 10.
    • If no, go to 12.

  3. On FusionInsight Manager, choose Cluster > Services > Flink > Configurations > All Configurations and search for flink.ha.floatip. Change the parameter value to the correct floating IP address, save the configuration, and restart the Flink service.

    • Contact the network engineer to obtain the new floating IP address.
    • During the service restart, FlinkServer cannot provide services, but submitted jobs are not affected.
    • During the restart, the FlinkServer instance cannot provide services, but submitted jobs are not affected.

  4. Check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 12.

Collect the fault information.

  1. On FusionInsight Manager, choose O&M > Log > Download.
  2. Select the Flink service in the required cluster for Service.
  3. Expand the Hosts drop-down list. In the Select Host dialog box that is displayed, select the hosts to which the role belongs, and click OK.
  4. Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
  5. Contact O&M personnel and provide the collected logs.

Alarm Clearance

This alarm is automatically cleared after the fault is rectified.

Related Information

None