What If Checkpoint Is Executed Slowly in RocksDBStateBackend Mode When the Data Amount Is Large
Question
What to do if checkpoint is executed slowly in RocksDBStateBackend mode when the data amount is large?
Cause Analysis
Customized windows are used and the window state is ListState. There are many values under the same key. In the case of a new value, the merge operation of RocksDB is used. When calculation is triggered, all values under the key are read.
- The RocksDB mode is merge() > merge() ... > merge() > read(). Data reading in this mode consumes much time, as shown in Figure 1.
- The source operator sends a large amount of data in a short period of time and the data keys are the same. The window operator fails to process data fast enough and barriers accumulate in the cache. The time consumed for snapshot preparation is too long and the window operator cannot report snapshot completion to CheckpointCoordinator in the specified time. Therefore, CheckpointCoordinator determines snapshot preparation failure, as shown in Figure 2.
Answer
Flink introduces the third-party software package RocksDB, whose defect causes the problem. You are advised to set checkpoint to FsStateBackend mode.
Set checkpoint to FsStateBackend mode in the application code as follows:
env.setStateBackend(new FsStateBackend("hdfs://hacluster/flink/checkpoint/"));
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.