ALM-45640 FlinkServer Heartbeat Interruption Between the Active and Standby Nodes

This section applies to MRS 3.2.0 or later.

Alarm Description

This alarm is generated when the FlinkServer active node or standby node does not receive heartbeat messages from the peer for 30 seconds (heartbeat interruption duration configured in keepalive).

This alarm is cleared when the heartbeat recovers.

Alarm Attributes

Alarm ID	Alarm Severity	Auto Cleared
45640	Minor	Yes

Alarm Parameters

Parameter	Description
Source	Specifies the cluster for which the alarm was generated.
ServiceName	Specifies the service for which the alarm was generated.
RoleName	Specifies the role for which the alarm was generated.
HostName	Specifies the host for which the alarm was generated.

Impact on the System

The impact varies depending on the cause. If the heartbeat is interrupted due to active node faults, the standby node will take over services as the active node. If the heartbeat is interrupted due to other reasons, for example, network problems, two active nodes may exist because the standby node has become the active node. Data synchronization between the active and standby nodes is abnormal, but FlinkServer can still provide services.

Possible Causes

The active or standby FlinkServer instance is in the stopped state.
The NIC of the floating IP address of the HA system used by the FlinkServer node is incorrectly configured. FlinkServer fails to be started.
The link between the active and standby FlinkServer nodes is abnormal.

Handling Procedure

Check the status of the active and standby FlinkServer instances.

Log in to FusionInsight Manager, choose Cluster > Services > Flink > Instance, and check the state of FlinkServer is normal.
- If yes, go to Step 3.
- If no, go to Step 2.
Select the abnormal FlinkServer instance and start the instance. After the instance is started, check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 3.
During the restart, the FlinkServer instance cannot provide services, but submitted jobs are not affected.

Check whether the link between the standby FlinkServer nodes is normal.

Choose Cluster > Services > Flink > Instance, and check the two service IP addresses of FlinkServer.
Log in to the server where the abnormal FlinkServer instance locates as the root user.
Run the following command to check whether the server of the other FlinkServer instance is reachable:

ping IP address of the other FlinkServer instance
- If yes, go to Step 8.
- If no, go to Step 6.
Ask the network administrator to handle the network exception.
Check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 8.

Check whether the logs of the node where the abnormal FlinkServer instance locates contains error information.

Log in to the server where the abnormal FlinkServer instance locates as the root user.
Open the log file in the default directory /var/log/Bigdata/flink/flinkserver/prestart.log and check whether there is error message Float ip x.x.x.x is invalid.
- If yes, go to Step 10.
- If no, go to Step 12.
On FusionInsight Manager, choose Cluster > Services > Flink > Configurations > All Configurations and search for flink.ha.floatip. Change the parameter value to the correct floating IP address, save the configuration, and restart the Flink service.
- Contact the network engineer to obtain the new floating IP address.
- During the service restart, FlinkServer cannot provide services, but submitted jobs are not affected.
- During the restart, the FlinkServer instance cannot provide services, but submitted jobs are not affected.
Check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 12.

Collect the fault information.

On FusionInsight Manager, choose O&M > Log > Download.
Select the Flink service in the required cluster for Service.
Expand the Hosts drop-down list. In the Select Host dialog box that is displayed, select the hosts to which the role belongs, and click OK.
Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
Send the collected fault logs to O&M personnel for help.