Updated on 2025-08-22 GMT+08:00

Configuring the Flink Job Restart Policy

Scenario

Flink supports different restart policies to control whether to restart and how to restart Flink jobs when a fault occurs. You can select a proper restart policy based on the application scenario to improve the stability and reliability of Flink jobs, ensure that Flink jobs can be recovered in a timely manner when a fault occurs, and avoid unnecessary resource consumption.

The restart policies include No restart, fixed-delay, and failure-rate.

  • No restart: If a fault occurs, the job fails and does not attempt to restart.
  • fixed-delay: If a fault occurs, the job will keep restarting until the maximum number of restart attempts is reached. Then, the job fails. The restart policy waits for a fixed period of time between two consecutive restart attempts.
  • failure-rate: If a job fails, the job will keep restarting until the configured failure rate is exceeded. Then, the job fails. The restart policy waits for a fixed period of time between two consecutive restart attempts.

The restart policies configured on the FlinkServer have a higher priority. If no restart policy is configured on FlinkServer, the restart policies on the Flink configuration page take effect.

Selecting a Restart Policy

  • If you do not want to retry a failed job, select the No restart policy.
  • If you want to retry a failed job, select the failure-rate policy. If the fixed-delay policy is used, the job may fail because the number of job failures reaches the maximum number of retries due to hardware faults such as network and memory faults.

Configuring the Restart Policy on FlinkServer

  1. Log in to FusionInsight Manager as a user with the FlinkServer administrator permissions. Choose Cluster > Services > Flink. On the right of Flink Web UI, click the link to access the Flink web UI.
  2. Click Job Management. The job management page is displayed.
  3. On the FlinkServer job management page, select an existing job or create a job by referring to Creating a Job, and go to the job development page.
  4. On the job development page, click Basic Parameter. In the failure recovery Policy area, select a policy from the drop-down list, and set the related parameters.

    • If you select fixed-delay, you need to configure Retry Times and Retry Interval (s).
    • If you select failure-rate, you need to configure Max Retry Times, Interval (min), and Retry Interval (s).
    • If you select none, no other parameters need to be set.

Configuring the Restart Policy on the Flink Configuration Page

  1. Log in to FusionInsight Manager and choose Cluster > Services > Flink > Configurations > All Configurations.
  2. Enter restart-strategy in the search box to view the restart policy configuration.

    1. No restart Policy

      When a fault occurs, the job fails and does not attempt to restart.

      Configure the parameter as follows:

      restart-strategy: none
    2. fixed-delay Policy

      If a fault occurs, the job will keep restarting until the maximum number of restart attempts is reached. Then, the job fails. The restart policy waits for a fixed period of time between two consecutive restart attempts.

      The following demonstrates an example where the job fails if it fails to be restarted for three times at an interval of 10 seconds:

      restart-strategy: fixed-delay
      restart-strategy.fixed-delay.attempts: 3
      restart-strategy.fixed-delay.delay: 10 s
    3. failure-rate Policy

      When a job fails, the job keeps restarting until the configured failure rate is exceeded. Then, the job fails. The restart policy waits for a fixed period of time between two consecutive restart attempts.

      The following demonstrates an example where the job fails to be restarted for three times within 10 minutes at an interval of 10 seconds:

      restart-strategy: failure-rate 
      restart-strategy.failure-rate.max-failures-per-interval: 3 
      restart-strategy.failure-rate.failure-rate-interval: 10 min 
      restart-strategy.failure-rate.delay: 10 s

  3. If no restart policy is specified, the cluster uses the default restart policy. The default restart policy (fixed-delay policy) is as follows:

    restart-strategy: fixed-delay 
    restart-strategy.fixed-delay.attempts: 3 
    restart-strategy.fixed-delay.delay: 10 s

  4. To adjust the restart policy, select the required restart policy and click Save. Choose Instances > Target instance name > More > Restart Instance. The restart policy takes effect after the instance is restarted.

Helpful Links

For details about how to create a FlinkServer job, see Creating a Job.