ALM-45638 Number of Restarts After Flink Job Failures Exceeds the Threshold
Alarm Description
The system checks the times a FlinkServer job restarts based on the alarm checking interval. This alarm is generated when the number exceeds the configured threshold. This alarm is cleared when the job is restarted.
Alarm Attributes
Alarm ID |
Alarm Severity |
Alarm Type |
Service Type |
Auto Cleared |
---|---|---|---|---|
45638 |
Major |
Quality of service |
Flink |
Yes |
Alarm Parameters
Type |
Parameter |
Description |
---|---|---|
Location Information |
Source |
Specifies the cluster for which the alarm was generated. |
ServiceName |
Specifies the service for which the alarm was generated. |
|
ApplicationName |
Specifies the name of the application for which the alarm was generated. |
|
JobName |
Specifies the job for which the alarm was generated. |
|
UserName |
Specifies the username for which the alarm was generated. |
|
Additional Information |
ThreshHoldValue |
Specifies the threshold value for triggering the alarm. |
CurrentValue |
Specifies the value that triggered the alarm. |
Impact on the System
Flink jobs are frequently restarted due to the failures. You need to locate the cause. This is a job-level alarm and has no impact on FlinkServer.
Possible Causes
You can view the causes in the specific logs.
Handling Procedure
- Log in to Manager as a user who has the FlinkServer management permission.
- Choose Cluster > Services > Yarn and click the link next to ResourceManager WebUI to go to the native Yarn page.
- Locate the failed task based on its name displayed in Location, search for and record the application ID of the job, and check whether the job logs are available on the native Yarn page.
Figure 1 Application ID of a job
- Click the application ID of the failed job to go to the job page.
- Click Logs in the Logs column to view JobManager logs.
Figure 2 Clicking Logs
- Click the ID in the Attempt ID column and click Logs in the Logs column to view TaskManager logs.
Figure 3 Clicking the ID in the Attempt ID column
Figure 4 Clicking Logs
You can also log in to Manager as a user who has the FlinkServer management permission. Choose Cluster > Services > Flink, and click the link next to Flink WebUI. On the displayed Flink web UI, click Job Management, click More in the Operation column, and select Job Monitoring to view TaskManager logs.
- Click Logs in the Logs column to view JobManager logs.
- View the logs of the failed job to rectify the fault, or contact the O&M engineers and send the collected fault logs. No further action is required.
If logs are unavailable on the Yarn page, download logs from HDFS.
- On Manager, choose Cluster > Services > HDFS, click the link next to NameNode WebUI to go to the HDFS page, choose Utilities > Browse the file system, and download logs in the /tmp/logs/Username/logs/Application ID of the failed job directory.
- View the logs of the failed job to rectify the fault, or contact the O&M engineers and send the collected fault logs.
Alarm Clearance
This alarm is cleared when the FlinkServer job is successfully restarted.
Related Information
None.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot