ALM-45644 Number of SST Files at Level 0 of RocksDB Continuously Exceeds the Threshold

This section applies to MRS 3.3.0 or later.

Alarm Description

The system checks the RocksDB monitoring data of jobs at the user-specified alarm reporting interval (metrics.reporter.alarm.job.alarm.rocksdb.metrics.duration, 180s by default). This alarm is generated when the number of SST files at level 0 of RocksDB for a job continuously exceeds the threshold (state.backend.rocksdb.level0_slowdown_writes_trigger, 20 by default). This alarm is cleared when the number of SST files at level 0 of RocksDB for the job is less than or equal to the threshold.

Alarm Attributes

Alarm ID	Alarm Severity	Auto Cleared
45644	Minor	Yes

Alarm Parameters

Parameter	Description
Source	Specifies the cluster for which the alarm is generated.
ServiceName	Specifies the service for which the alarm is generated.
ApplicationName	Specifies the name of the application for which the alarm is generated.
RoleName	Specifies the role for which the alarm is generated.
JobName	Specifies the job for which the alarm is generated.

Impact on the System

The checkpoint performance of Flink jobs is affected. There is no impact on the FlinkServer.

Possible Causes

Possible causes are as follows:

The compaction pressure of RocksDB is too high, and ALM-45646 Pending Compaction Size of RocksDB Continuously Exceeds the Threshold and ALM-45647 Estimated Pending Compaction Size of RocksDB Continuously Exceeds the Threshold are generated.
There are too many SST files at level 0.

Handling Procedure

Check whether the compaction pressure of RocksDB is too high and ALM-45646 is generated.

On FusionInsight Manager, choose O&M > Alarm > Alarms.
In the alarm list, check whether ALM-45646 Pending Compaction Size of RocksDB Continuously Exceeds the Threshold exists.
- If yes, go to Step 3.
- If no, go to Step 5.
Handle the alarm by following the instructions provided in section ALM-45646 Pending Compaction Size of RocksDB Continuously Exceeds the Threshold-Crossing .
After ALM-45646 is cleared, wait a few minutes and check whether this alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 5.

Check whether the compaction pressure of RocksDB is too high and ALM-45647 is generated.

In the alarm list, check whether ALM-45647 Estimated Pending Compaction Size of RocksDB Continuously Exceeds the Threshold exists.
- If yes, go to Step 6.
- If no, go to Step 8.
Handle the alarm by following the instructions provided in section ALM-45647 Estimated Pending Compaction Size of RocksDB Continuously Exceeds the Threshold.
After ALM-45647 is cleared, wait a few minutes and check whether this alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 8.

Check whether there are too many SST files at level 0.

Log in to FusionInsight Manager as a user who has the FlinkServer management permission.

For details about how to create a user with the FlinkServer permissions, see Creating a FlinkServer Role.
Choose O&M > Alarm > Alarms > ALM-45644 Number of SST Files at Level 0 of RocksDB Continuously Exceeds the Threshold, view Location, and obtain the name of the task for which the alarm is generated.
Choose Cluster > Services > Yarn and click the link next to ResourceManager WebUI to go to the native Yarn page.

Locate the abnormal task based on its name displayed in Location, search for and record the application ID of the job, and check whether the job logs are available on the Yarn page.

Figure 1 Application ID of a job
- If yes, go to Step 12.
- If no, go to Step 13.
Click the application ID of the failed job to go to the job page.
1. Click Logs in the Logs column to view JobManager logs.
  Figure 2 Clicking Logs
2. Click the ID in the Attempt ID column and click Logs in the Logs column to view and save TaskManager logs. Then go to Step 14.
  Figure 3 Clicking the ID in the Attempt ID column
  
  Figure 4 Clicking Logs
  
  You can also log in to Manager as a user who has the management permission for the current Flink job. Choose Cluster > Services > Flink, and click the link next to Flink WebUI. On the displayed Flink web UI, click Job Management, click More in the Operation column, and select Job Monitoring to view TaskManager logs.

If logs are unavailable on the Yarn page, download logs from HDFS.

On Manager, choose Cluster > Services > HDFS, click the link next to NameNode WebUI to go to the HDFS page, choose Utilities > Browse the file system, and download logs in the /tmp/logs/Username/bucket-logs-tfile/Last four digits of the task application ID/Application ID of the task directory.

Check whether the number of SST files at level 0 is too large.

Check whether the value of rocksdb.num-files-at-level0 in TaskManager monitoring logs (keyword RocksDBMetricPrint) is greater than or equal to the value of state.backend.rocksdb.level0_slowdown_writes_trigger or state.backend.rocksdb.level0_stop_writes_trigger.

If yes, modify the values of the following custom parameters on the job development page of the Flink WebUI, save the modification, and go to Step 15.

**Table 1** Custom parameters
Parameter	Default Value	Description
state.backend.rocksdb.level0_slowdown_writes_trigger	20	Number of files that trigger slowdown at level 0 Recommended value: 20 to 30
state.backend.rocksdb.level0_stop_writes_trigger	36	Maximum number of files that trigger stop at level 0 Recommended value: 36 to 46

If no, go to Step 16.

Restart the job and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 16.
Contact O&M personnel and send the collected logs.