Updated on 2024-12-11 GMT+08:00

Configuring the FlinkServer Job Restart Policy

Overview

Flink supports different restart policies to control whether and how to restart a job when a fault occurs. If no restart policy is specified, the cluster uses the default restart policy. You can also specify a restart policy when submitting a job. For details about how to configure such a policy on the job development page of MRS 3.1.0 or later, see Creating a FlinkServer Job.

The restart policy can be specified by configuring the restart-strategy parameter in the Flink configuration file Client installation directory/Flink/flink/conf/flink-conf.yaml or can be dynamically specified in the application code. The configuration takes effect globally. Restart policies include failure-rate and the following two default policies:

  • No restart: If CheckPoint is not enabled, this policy is used by default.
  • Fixed-delay: If CheckPoint is enabled but no restart policy is configured, this policy is used by default.

No restart Policy

When a fault occurs, the job fails and does not attempt to restart.

Configure the parameter as follows:

restart-strategy: none

fixed-delay Policy

When a fault occurs, the job attempts to restart for a fixed number of times. If the number of attempts exceeds the times you specified, the job fails. The restart policy waits for a fixed period of time between two consecutive restart attempts.

In the following example, a job fails if the job attempts to restart for three times at an interval of 10 seconds. Configure the parameters as follows:

restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 3
restart-strategy.fixed-delay.delay: 10 s

failure-rate Policy

When a job fails, the job restarts directly. If the failure rate exceeds the value you configured, the job is considered as failed. The restart policy waits for a fixed period of time between two consecutive restart attempts.

In the following example, a job is considered as failed if the job attempts to restart for three times at an interval of 10 minutes. Configure the parameters as follows:

restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 10 min
restart-strategy.failure-rate.delay: 10 s

Choosing a Restart Policy

  • If you do not want to retry a failed job, select the No restart policy.
  • To retry a failed job, select the failure-rate policy. If the fixed-delay policy is used, the number of job failures may reach the maximum number of retries due to hardware faults such as network and memory faults. As a result, the job fails.

    To prevent repeated restarts when the failure-rate policy is used, configure parameters as follows:

    restart-strategy: failure-rate
    restart-strategy.failure-rate.max-failures-per-interval: 3
    restart-strategy.failure-rate.failure-rate-interval: 10 min
    restart-strategy.failure-rate.delay: 10 s