Help Center/ MapReduce Service/ Developer Guide (LTS)/ Flink Development Guide (Normal Mode)/ More Information/ FAQ/ What If Checkpoint Is Executed Slowly in RocksDBStateBackend Mode When the Data Amount Is Large
Updated on 2022-11-18 GMT+08:00

What If Checkpoint Is Executed Slowly in RocksDBStateBackend Mode When the Data Amount Is Large

Question

What to do if checkpoint is executed slowly in RocksDBStateBackend mode when the data amount is large?

Cause Analysis

Customized windows are used and the window state is ListState. There are many values under the same key. In the case of a new value, the merge operation of RocksDB is used. When calculation is triggered, all values under the key are read.

  • The RocksDB mode is merge() > merge() ... > merge() > read(). Data reading in this mode consumes much time, as shown in Figure 1.
  • The source operator sends a large amount of data in a short period of time and the data keys are the same. The window operator fails to process data fast enough and barriers accumulate in the cache. The time consumed for snapshot preparation is too long and the window operator cannot report snapshot completion to CheckpointCoordinator in the specified time. Therefore, CheckpointCoordinator determines snapshot preparation failure, as shown in Figure 2.
Figure 1 Time monitoring
Figure 2 Relationship

Answer

Flink introduces the third-party software package RocksDB, whose defect causes the problem. You are advised to set checkpoint to FsStateBackend mode.

Set checkpoint to FsStateBackend mode in the application code as follows:

env.setStateBackend(new FsStateBackend("hdfs://hacluster/flink/checkpoint/"));