Help Center/ MapReduce Service/ Developer Guide (Normal_Earlier Than 3.x)/ Flink Application Development/ FAQs/ What Should I Do If Running a Checkpoint Is Slow When RocksDBStateBackend is Set for the Checkpoint and a Large Amount of Data Exists?
Updated on 2022-09-14 GMT+08:00

What Should I Do If Running a Checkpoint Is Slow When RocksDBStateBackend is Set for the Checkpoint and a Large Amount of Data Exists?

Issue

What should I do if running a checkpoint is slow when RocksDBStateBackend is set for the checkpoint and a large amount of data exists?

Possible Causes

Customized windows are used and the window state is ListState. There are many values under the same key. The merge() operation of RocksDB is used every time when a new value is added. When calculation is triggered, all values under the key are read.

  • The RocksDB mode is merge()->merge()....->merge()->read(), which is time-consuming during data reading, as shown in Figure 1.
  • When a source operator sends a large amount of data in an instant, the key values of all data are the same, which slows down window operator processing. As a result, the barriers are accumulated in the buffer and the completion of snapshot creation is delayed. The window operator fails to report a snapshot creation success to CheckpointCoordinator on time so that CheckpointCoordinator considers that the snapshot fails to be created. Figure 2 shows a data flow.
    Figure 1 Time monitoring information
Figure 2 Data flow

Answer

Flink introduces the third-party software package RocksDB, whose defect causes the problem. You are advised to set checkpoint to FsStateBackend.

The following provides an example to show how to set checkpoint to FsStateBackend in the application code. The following provides an example:

 env.setStateBackend(new FsStateBackend("hdfs://hacluster/flink-checkpoint/checkpoint/"));