ALM-45648 RocksDB Frequently Encounters Write-Stopped
This section applies to MRS 3.3.0 or later.
Alarm Description
The system checks the RocksDB monitoring data of jobs at the user-specified alarm reporting interval (metrics.reporter.alarm.job.alarm.rocksdb.metrics.duration, 180s by default). This alarm is generated when RocksDB for a job continuously encounters the is-write-stopped state. This alarm is cleared when RocksDB for the job no longer or does not continuously encounter the is-write-stopped state within an alarm reporting interval.
Alarm Attributes
Alarm ID |
Alarm Severity |
Auto Cleared |
---|---|---|
45648 |
Minor |
Yes |
Alarm Parameters
Parameter |
Description |
---|---|
Source |
Specifies the cluster for which the alarm is generated. |
ServiceName |
Specifies the service for which the alarm is generated. |
ApplicationName |
Specifies the name of the application for which the alarm is generated. |
RoleName |
Specifies the role for which the alarm is generated. |
JobName |
Specifies the job for which the alarm is generated. |
Impact on the System
This alarm has no adverse impact on the system.
Possible Causes
The possible causes are as follows:
- There are too many MemTables and ALM-45643 MemTable Size of RocksDB Continuously Exceeds the Threshold is generated.
- There are too many SST files at level 0, and ALM-45644 Number of SST Files at Level 0 of RocksDB Continuously Exceeds the Threshold is generated.
- The estimated compaction size exceeds the threshold, and ALM-45647 Estimated Pending Compaction Size of RocksDB Continuously Exceeds the Threshold is generated.
Handling Procedure
Check whether there are too many MemTables.
- On FusionInsight Manager, choose O&M > Alarm > Alarms.
- In the alarm list, check whether ALM-45643 MemTable Size of RocksDB Continuously Exceeds the Threshold exists.
- Handle the alarm by following the instructions provided in section ALM-45643 MemTable Size of RocksDB Continuously Exceeds the Threshold.
- After ALM-45643 is cleared, wait a few minutes and check whether this alarm is cleared.
- If yes, no further action is required.
- If no, go to 5.
Check whether the number of SST files at level 0 is too large.
- On FusionInsight Manager, choose O&M > Alarm > Alarms.
- In the alarm list, check whether ALM-45644 Number of SST Files at Level 0 of RocksDB Continuously Exceeds the Threshold exists.
- Handle the alarm by following the instructions provided in section ALM-45644 Number of SST Files at Level 0 of RocksDB Continuously Exceeds the Threshold.
- After ALM-45644 is cleared, wait a few minutes and check whether this alarm is cleared.
- If yes, no further action is required.
- If no, go to 9.
Check whether the estimated compaction size exceeds the threshold.
- In the alarm list, check whether ALM-45647 Estimated Pending Compaction Size of RocksDB Continuously Exceeds the Threshold exists.
- Handle the alarm by following the instructions provided in section ALM-45647 Estimated Pending Compaction Size of RocksDB Continuously Exceeds the Threshold.
- After ALM-45647 is cleared, wait a few minutes and check whether this alarm is cleared.
- If yes, no further action is required.
- If no, go to 12.
Collect fault information.
- Log in to Manager as a user who has the management permission for the current Flink job.
- Choose O&M > Alarm > Alarms > ALM-45648 RocksDB Frequently Encounters Write-Stopped, view Location, and obtain the name of the task for which the alarm is generated.
- Choose Cluster > Services > Yarn and click the link next to ResourceManager WebUI to go to the native Yarn page.
- Locate the abnormal task based on its name displayed in Location, search for and record the application ID of the job, and check whether the job logs are available on the Yarn page.
Figure 1 Application ID of a job
- Click the application ID of the failed job to go to the job page.
- Click Logs in the Logs column to view JobManager logs.
Figure 2 Clicking Logs
- Click the ID in the Attempt ID column and click Logs in the Logs column to view and save TaskManager logs.
Figure 3 Clicking the ID in the Attempt ID column
Figure 4 Clicking Logs
You can also log in to Manager as a user who has the management permission for the current Flink job. Choose Cluster > Services > Flink, and click the link next to Flink WebUI. On the displayed Flink web UI, click Job Management, click More in the Operation column, and select Job Monitoring to view TaskManager logs.
- Click Logs in the Logs column to view JobManager logs.
- View the job logs to rectify the fault, or contact the O&M personnel and send the collected fault logs. No further action is required.
If logs are unavailable on the Yarn page, download logs from HDFS.
- On Manager, choose Cluster > Services > HDFS, click the link next to NameNode WebUI to go to the HDFS page, choose Utilities > Browse the file system, and download logs in the /tmp/logs/Username/bucket-logs-tfile/Last four digits of the task application ID/Application ID of the task directory.
- View the logs of the failed job to rectify the fault, or contact the O&M personnel and send the collected fault logs.
Alarm Clearance
This alarm is automatically cleared after the fault is rectified.
Related Information
None.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.