How Can I Check if a Flink Job Can Be Restored From a Checkpoint After Restarting It?
What Is Restoration from a Checkpoint?
Flink's checkpointing is a fault tolerance and recovery mechanism. This mechanism ensures that real-time programs can self-recover in case of exceptions or machine issues during runtime.
Principles for Restoration from Checkpoints
- When a job fails to be executed or a resource restarts due to an exception that is not triggered by manual operations, data can be restored from a checkpoint.
- However, if the calculation logic of a job is modified, the job cannot be restored from a checkpoint.
Application Scenarios
Table 1 lists some common scenarios of restoring data from a checkpoint for your reference.
For more scenarios, refer to Principles for Restoration from Checkpoints and assess whether data can be restored from a checkpoint based on the actual situation.
Scenario |
Restoration from a Checkpoint |
Description |
---|---|---|
Adjust or increase the number of concurrent tasks. |
Not supported |
This operation alters the parallelism of the job, thereby changing its execution logic. |
Modify Flink SQL statements and Flink Jar jobs. |
Not supported |
This operation modifies the algorithmic logic of the job with respect to resources. For example, if the original algorithm involves addition and subtraction, but the desired state requires multiplication, division, and modulo operations, it cannot be restored directly from the checkpoint. |
Modify the static stream graph. |
Not supported |
This operation modifies the algorithmic logic of the job with respect to resources. |
Modify the CU(s) per TM parameter. |
Supported |
The modification of compute resources does not affect the operational logic of the job's algorithm or operators. |
A job runs abnormally or there is a physical power outage. |
Supported |
The job parameters and algorithm logic are not modified. |
Related Operation: How Do I Restore a Job from a Checkpoint?
Since the Flink checkpoint and savepoint generation mechanisms and formats are consistent, you can restore the Flink job from the latest successful checkpoint in OBS. Specifically, in the Flink job list, locate the desired Flink job, click More in the Operation column, and select Import Savepoint to import the checkpoint.
- Log in to the DLI console. In the navigation pane on the left, choose Job Management > Flink Jobs.
- Locate the row that contains the target Flink job and click Import Savepoint in the Operation column.
- In the displayed dialog box, select the OBS bucket path storing the checkpoint. The checkpoint save path is Bucket name/jobs/checkpoint/Directory starting with the job ID. Click OK.
- Restart the Flink job again. The job will be restored fom the checkpoint path.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.