What Should I Do If Running a Checkpoint Is Slow When RocksDBStateBackend is Set for the Checkpoint and a Large Amount of Data Exists?
Issue
What should I do if running a checkpoint is slow when RocksDBStateBackend is set for the checkpoint and a large amount of data exists?
Possible Causes
Customized windows are used and the window state is ListState. There are many values under the same key. The merge() operation of RocksDB is used every time when a new value is added. When calculation is triggered, all values under the key are read.
- The RocksDB mode is merge()->merge()....->merge()->read(), which is time-consuming during data reading, as shown in Figure 1.
- When a source operator sends a large amount of data in an instant, the key values of all data are the same, which slows down window operator processing. As a result, the barriers are accumulated in the buffer and the completion of snapshot creation is delayed. The window operator fails to report a snapshot creation success to CheckpointCoordinator on time so that CheckpointCoordinator considers that the snapshot fails to be created. Figure 2 shows a data flow.
Answer
Flink introduces the third-party software package RocksDB, whose defect causes the problem. You are advised to set checkpoint to FsStateBackend.
The following provides an example to show how to set checkpoint to FsStateBackend in the application code. The following provides an example:
env.setStateBackend(new FsStateBackend("hdfs://hacluster/flink-checkpoint/checkpoint/"));
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot