How Can I Check if a Flink Job Can Be Restored From a Checkpoint After Restarting It?

What Is Restoration from a Checkpoint?

Flink's checkpointing is a fault tolerance and recovery mechanism. This mechanism ensures that real-time programs can self-recover in case of exceptions or machine issues during runtime.

Principles for Restoration from Checkpoints

When a job fails to be executed or a resource restarts due to an exception that is not triggered by manual operations, data can be restored from a checkpoint.
However, if the calculation logic of a job is modified, the job cannot be restored from a checkpoint.

Application Scenarios

Table 1 lists some common scenarios of restoring data from a checkpoint for your reference.

For more scenarios, refer to Principles for Restoration from Checkpoints and assess whether data can be restored from a checkpoint based on the actual situation.

**Table 1** Common scenarios of restoring data from a checkpoint
Scenario	Restoration from a Checkpoint	Description
Adjust or increase the number of concurrent tasks.	Not supported	This operation alters the parallelism of the job, thereby changing its execution logic.
Modify Flink SQL statements and Flink Jar jobs.	Not supported	This operation modifies the algorithmic logic of the job with respect to resources. For example, if the original algorithm involves addition and subtraction, but the desired state requires multiplication, division, and modulo operations, it cannot be restored directly from the checkpoint.
Modify the static stream graph.	Not supported	This operation modifies the algorithmic logic of the job with respect to resources.
Modify the CU(s) per TM parameter.	Supported	The modification of compute resources does not affect the operational logic of the job's algorithm or operators.
A job runs abnormally or there is a physical power outage.	Supported	The job parameters and algorithm logic are not modified.

Related Operation: How Do I Restore a Job from a Checkpoint?

Since the Flink checkpoint and savepoint generation mechanisms and formats are consistent, you can restore the Flink job from the latest successful checkpoint in OBS. Specifically, in the Flink job list, locate the desired Flink job, click More in the Operation column, and select Import Savepoint to import the checkpoint.

Log in to the DLI console. In the navigation pane on the left, choose Job Management > Flink Jobs.
Locate the row that contains the target Flink job and click Import Savepoint in the Operation column.
In the displayed dialog box, select the OBS bucket path storing the checkpoint. The checkpoint save path is Bucket name/jobs/checkpoint/Directory starting with the job ID. Click OK.
Restart the Flink job again. The job will be restored fom the checkpoint path.

Parent topic: Flink Job Performance Tuning

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

For any further questions, feel free to contact us through the chatbot.

Chatbot