How Can I Check if a Flink Job Can Be Restored From a Checkpoint After Restarting It?
What Is Restoration from a Checkpoint?
Flink's checkpointing is a fault tolerance and recovery mechanism. This mechanism ensures that real-time programs can self-recover in case of exceptions or machine issues during runtime.
Principles for Restoration from Checkpoints
- When a job fails to be executed or a resource restarts due to an exception that is not triggered by manual operations, data can be restored from a checkpoint.
- However, if the calculation logic of a job is modified, the job cannot be restored from a checkpoint.
Application Scenarios
Table 1 lists some common scenarios of restoring data from a checkpoint for your reference.
For more scenarios, refer to Principles for Restoration from Checkpoints and assess whether data can be restored from a checkpoint based on the actual situation.
Scenario |
Restoration from a Checkpoint |
Description |
---|---|---|
Adjust or increase the number of concurrent tasks. |
Not supported |
This operation alters the parallelism of the job, thereby changing its execution logic. |
Modify Flink SQL statements and Flink Jar jobs. |
Not supported |
This operation modifies the algorithmic logic of the job with respect to resources. For example, if the original algorithm involves addition and subtraction, but the desired state requires multiplication, division, and modulo operations, it cannot be restored directly from the checkpoint. |
Modify the static stream graph. |
Not supported |
This operation modifies the algorithmic logic of the job with respect to resources. |
Modify the CU(s) per TM parameter. |
Supported |
The modification of compute resources does not affect the operational logic of the job's algorithm or operators. |
A job runs abnormally or there is a physical power outage. |
Supported |
The job parameters are not modified. |
O&M Guide FAQs
- How Do I Locate a Flink Job Submission Error?
- How Do I Locate a Flink Job Running Error?
- How Can I Check if a Flink Job Can Be Restored From a Checkpoint After Restarting It?
- Why Does DIS Stream Not Exist During Job Semantic Check?
- Why Is the OBS Bucket Selected for Job Not Authorized?
- Why Are Logs Not Written to the OBS Bucket After a DLI Flink Job Fails to Be Submitted for Running?
- How Do I Configure Connection Retries for Kafka Sink If it is Disconnected?
- Why Is Information Displayed on the FlinkUI/Spark UI Page Incomplete?
- Why Is the Flink Job Abnormal Due to Heartbeat Timeout Between JobManager and TaskManager?
- Why Is Error "Timeout expired while fetching topic metadata" Repeatedly Reported in Flink JobManager Logs?
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbotmore