ALM-45638 Number of Restarts After Flink Job Failures Exceeds the Threshold

Description

The system checks the number of FlinkServer job restarts at intervals defined by the parameter metrics.reporter.alarm.interval (default value: 30s). This alarm is generated when the number exceeds the configured threshold. This alarm is cleared when the job is restarted.

The alarm threshold is calculated as follows: Threshold = Percentage of failed restarts that trigger an alarm x Maximum number of allowed job restarts. The default threshold in MRS 3.2.1 or later is 3 (calculated by rounding up 80% x 3).

Percentage of failed restarts that trigger an alarm: specified by the parameter metrics.reporter.alarm.job.alarm.failure.restart.rate. The default value is 80.
Maximum number of allowed job restarts: The value varies depending on the restart strategy configured for the current Flink job. The default restart strategy is fixed-delay in MRS 3.2.1 or later.
- If the restart strategy is none, this alarm is not triggered. (The default restart strategy is none in versions earlier than MRS 3.2.1).
- If the restart strategy is failure-rate, the default maximum number of allowed restarts is 1 (restart-strategy.failure-rate.max-failures-per-interval).
  In this case, the alarm threshold is 1 (calculated by rounding up 80% × 1). This means that if the job restarts even once, the system will generate an alarm.
- If the restart strategy is fixed-delay, the default maximum number of allowed restarts is 3 (restart-strategy.fixed-delay.attempts).
  In this case, the alarm threshold is 3 (calculated by rounding up 80% × 3). This means that if the job restarts three times, the system will generate an alarm.

This section applies to MRS 3.2.0-LTS.1 or later.

Attribute

Alarm ID	Alarm Severity	Auto Clear
45638	Major	Yes

Parameters

Name	Meaning
Source	Specifies the cluster for which the alarm is generated.
ServiceName	Specifies the service for which the alarm is generated.
JobName	Specifies the job for which the alarm is generated.
Username	Specifies the username of the job for which the alarm is generated.

Impact on the System

If the number of Flink job restarts exceeds the configured threshold, the job fails repeatedly and restarts frequently. In such cases, the job developer should check the job logs to identify the root cause of the failure. This alarm is a Flink job alarm and does not affect FlinkServer.

Possible Causes

You can view the causes in the specific logs.

Procedure

Log in to Manager as a user who has the FlinkServer management permission.
Choose Cluster > Services > Yarn and click the link next to ResourceManager WebUI to go to the Yarn page.
Locate the failed job based on its name displayed in Location, search for and record the application ID of the failed job, and check whether the job logs are available on the Yarn page.

Figure 1 Application ID of a job

If yes, go to Step 4.

If no, go to Step 6.
Click the application ID of the failed job to go to the job page.
1. Click Logs in the Logs column to view JobManager logs.
  Figure 2 Clicking Logs
2. Click the ID in the Attempt ID column and click Logs in the Logs column to view TaskManager logs.
  Figure 3 Clicking the ID in the Attempt ID column
  
  Figure 4 Clicking Logs
  
  You can also log in to Manager as a user who has the FlinkServer management permission, choose Cluster > Services > Flink, click the link next to Flink WebUI. On the displayed Flink web UI, click Job Management and choose More > Job Monitoring in the Operation column to view the TaskManager logs.
View the logs of the failed job to rectify the fault, or contact the O&M personnel personnel and send the collected fault logs. No further action is required.

If logs are unavailable on the Yarn page, download logs from HDFS.

On Manager, choose Cluster > Services > HDFS, click the link next to NameNode WebUI to go to the HDFS page, select Utilities > Browse the file system, and download logs in the /tmp/logs/User name/logs/Application ID of the failed job directory.
View the logs of the failed job to rectify the fault, or contact the O&M personnel personnel and send the collected fault logs.

Alarm Clearing

This alarm is cleared when the Flink job is successfully restarted.

Related Information

None

Parent Topic: MRS Cluster Alarm Handling Reference

Previous topic: ALM-45638 Number of Restarts After FlinkServer Job Failures Exceeds the Threshold

Next topic: ALM-45639 Checkpointing of a Flink Job Times Out

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.

The system is busy. Please try again later.