ALM-45640 FlinkServer Heartbeat Interruption Between the Active and Standby Nodes
This section applies to MRS 3.2.0 or later.
Alarm Description
This alarm is generated when the FlinkServer active node or standby node does not receive heartbeat messages from the peer for 30 seconds (heartbeat interruption duration configured in keepalive).
This alarm is cleared when the heartbeat recovers.
Alarm Attributes
Alarm ID |
Alarm Severity |
Auto Cleared |
---|---|---|
45640 |
Minor |
Yes |
Alarm Parameters
Parameter |
Description |
---|---|
Source |
Specifies the cluster for which the alarm was generated. |
ServiceName |
Specifies the service for which the alarm was generated. |
RoleName |
Specifies the role for which the alarm was generated. |
HostName |
Specifies the host for which the alarm was generated. |
Impact on the System
The impact varies depending on the cause. If the heartbeat is interrupted due to other reasons, for example, network problems, two active nodes may exist because the standby node became the active node. Data synchronization between the active and standby nodes is abnormal, but FlinkServer can still provide services.
Possible Causes
- The active or standby FlinkServer instance is in the stopped state.
- The NIC of the floating IP address of the HA system used by the FlinkServer node is incorrectly configured. FlinkServer fails to be started.
- The link between the active and standby FlinkServer nodes is abnormal.
Handling Procedure
Check the status of the active and standby FlinkServer instances.
- Log in to FusionInsight Manager, choose Cluster > Services > Flink > Instance, and check the state of FlinkServer is normal.
- Select the abnormal FlinkServer instance and start the instance. After the instance is started, check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 3.
During the restart, the FlinkServer instance cannot provide services, but submitted jobs are not affected.
Check whether the link between the standby FlinkServer nodes is normal.
- Choose Cluster > Services > Flink > Instance, and check the two service IP addresses of FlinkServer.
- Log in to the server where the abnormal FlinkServer instance locates as the root user.
- Run the following command to check whether the server of the other FlinkServer instance is reachable:
ping IP address of the other FlinkServer instance
- Ask the network administrator to handle the network exception.
- Check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 8.
Check whether the logs of the node where the abnormal FlinkServer instance locates contains error information.
- Log in to the server where the abnormal FlinkServer instance locates as the root user.
- Open the log file in the default directory /var/log/Bigdata/flink/flinkserver/prestart.log and check whether there is error message Float ip x.x.x.x is invalid.
- On FusionInsight Manager, choose Cluster > Services > Flink > Configurations > All Configurations and search for flink.ha.floatip. Change the parameter value to the correct floating IP address, save the configuration, and restart the Flink service.
- Contact the network engineer to obtain the new floating IP address.
- During the service restart, FlinkServer cannot provide services, but submitted jobs are not affected.
- During the restart, the FlinkServer instance cannot provide services, but submitted jobs are not affected.
- Check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 12.
Collect the fault information.
- On FusionInsight Manager, choose .
- Select the Flink service in the required cluster for Service.
- Expand the Hosts drop-down list. In the Select Host dialog box that is displayed, select the hosts to which the role belongs, and click OK.
- Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
- Contact O&M personnel and provide the collected logs.
Alarm Clearance
This alarm is automatically cleared after the fault is rectified.
Related Information
None
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot