ALM-45649 P95 Latency of RocksDB Get Requests Continuously Exceeds the Threshold

Alarm Description

The system checks the RocksDB monitoring data of jobs at the user-specified alarm reporting interval (metrics.reporter.alarm.job.alarm.rocksdb.metrics.duration, 180s by default). This alarm is generated when the P95 latency of RocksDB Get requests exceeds the threshold (metrics.reporter.alarm.job.alarm.rocksdb.get.micros.threshold, 50000 microseconds by default). This alarm is cleared when the P95 latency of RocksDB Get requests is less than or equal to the threshold.

Alarm Attributes

Alarm ID	Alarm Severity	Alarm Type	Service Type	Auto Cleared
45649	Minor	Quality of service	Flink	Yes

Alarm Parameters

Type	Parameter	Description
Location Information	Source	Specifies the cluster for which the alarm was generated.
	ServiceName	Specifies the service for which the alarm was generated.
	ApplicationName	Specifies the name of the application for which the alarm was generated.
	JobName	Specifies the job for which the alarm was generated.
	UserName	Specifies the username for which the alarm was generated.
Additional Information	ThreshHoldValue	Specifies the threshold value for triggering the alarm.
Additional Information	CurrentValue	Specifies the value that triggered the alarm.

Impact on the System

The checkpoint performance of Flink jobs are affected. There is no impact on the FlinkServer.

Possible Causes

The possible causes are as follows:

There are too many SST files at level 0, causing slow queries. In addition, ALM-45644 Number of SST Files at Level 0 of RocksDB Continuously Exceeds the Threshold is generated.
The cache hit ratio is lower than 60%, causing frequent swap-ins and swap-outs of the block cache.

Handling Procedure

Check whether the number of SST files at level 0 is too large.

On FusionInsight Manager, choose O&M > Alarm > Alarms.
In the alarm list, check whether ALM-45644 Number of SST Files at Level 0 of RocksDB Continuously Exceeds the Threshold exists.
- If yes, go to 3.
- If no, go to 5.
Handle the alarm by following the instructions provided in section ALM-45644 Number of SST Files at Level 0 of RocksDB Continuously Exceeds the Threshold.
After ALM-45644 is cleared, wait a few minutes and check whether this alarm is cleared.
- If yes, no further action is required.
- If no, go to 5.

Check the cache hit ratio in TaskManager logs and collect logs.

Log in to FusionInsight Manager as a user who has the FlinkServer management permission.
Choose O&M > Alarm > Alarms > ALM-45649 P95 Latency of RocksDB Get Requests Continuously Exceeds the Threshold, view Location, and obtain the name of the task for which the alarm is generated.
Choose Cluster > Services > Yarn and click the link next to ResourceManager WebUI to go to the native Yarn page.

Locate the abnormal task based on its name displayed in Location, search for and record the application ID of the job, and check whether the job logs are available on the Yarn page.

Figure 1 Application ID of a job
- If yes, go to 9.
- If no, go to 10.
Click the application ID of the failed job to go to the job page.
1. Click Logs in the Logs column to view JobManager logs.
  Figure 2 Clicking Logs
2. Click the ID in the Attempt ID column and click Logs in the Logs column to view and save TaskManager logs. Then go to 11.
  Figure 3 Clicking the ID in the Attempt ID column
  
  Figure 4 Clicking Logs
  
  You can also log in to Manager as a user who has the management permission for the current Flink job. Choose Cluster > Services > Flink, and click the link next to Flink WebUI. On the displayed Flink web UI, click Job Management, click More in the Operation column, and select Job Monitoring to view TaskManager logs.

If logs are unavailable on the Yarn page, download logs from HDFS.

On Manager, choose Cluster > Services > HDFS, click the link next to NameNode WebUI to go to the HDFS page, choose Utilities > Browse the file system, and download logs in the /tmp/logs/Username/bucket-logs-tfile/Last four digits of the task application ID/Application ID of the task directory.

Check whether the cache hit ratio is too low.

Check the values of rocksdb.block.cache.hit (cache hit) and rocksdb.block.cache.miss (cache miss) in TaskManager monitoring logs (keyword RocksDBMetricPrint). Calculate the hit ratio using the following formula and check whether it is less than 60%:

rocksdb.block.cache.hit/(rocksdb.block.cache.hit+rocksdb.block.cache.miss)

If yes, adjust the values of the following custom parameters on the job development page of the Flink web UI, save the settings, and go to 12.

**Table 1** Custom parameters
Parameter	Default Value	Description
state.backend.rocksdb.block.cache-size	8MB 256MB: enables SPINNING_DISK_OPTIMIZED_HIGH_MEM.	Cache size 8MB to 1GB are recommended.
state.backend.rocksdb.block.blocksize	4KB 128KB: enables SPINNING_DISK_OPTIMIZED_HIGH_MEM.	Block size 4KB to 256KB are recommended.
state.backend.rocksdb.use-bloom-filter	false	Whether to speed up indexing. If it is true, each new SST file will contain a Bloom filter. true is recommended.

If no, go to 13.

Restart the job and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 13.
Contact O&M personnel and send the collected logs.