Configuring the FlinkServer Job Restart Policy

Overview

Flink supports different restart policies to control whether and how to restart a job when a fault occurs. If no restart policy is specified, the cluster uses the default restart policy. You can also specify a restart policy when submitting a job. For details about how to configure such a policy on the job development page of MRS 3.1.0 or later, see Creating a FlinkServer Job.

The restart policy can be specified by configuring the restart-strategy parameter in the Flink configuration file Client installation directory/Flink/flink/conf/flink-conf.yaml or can be dynamically specified in the application code. The configuration takes effect globally. Restart policies include failure-rate and the following two default policies:

No restart: If CheckPoint is not enabled, this policy is used by default.
Fixed-delay: If CheckPoint is enabled but no restart policy is configured, this policy is used by default.

No restart Policy

When a fault occurs, the job fails and does not attempt to restart.

Configure the parameter as follows:

restart-strategy: none

fixed-delay Policy

When a fault occurs, the job attempts to restart for a fixed number of times. If the number of attempts exceeds the times you specified, the job fails. The restart policy waits for a fixed period of time between two consecutive restart attempts.

In the following example, a job fails if the job attempts to restart for three times at an interval of 10 seconds. Configure the parameters as follows:

restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 3
restart-strategy.fixed-delay.delay: 10 s

failure-rate Policy

When a job fails, the job restarts directly. If the failure rate exceeds the value you configured, the job is considered as failed. The restart policy waits for a fixed period of time between two consecutive restart attempts.

In the following example, a job is considered as failed if the job attempts to restart for three times at an interval of 10 minutes. Configure the parameters as follows:

restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 10 min
restart-strategy.failure-rate.delay: 10 s

Choosing a Restart Policy

If you do not want to retry a failed job, select the No restart policy.
To retry a failed job, select the failure-rate policy. If the fixed-delay policy is used, the number of job failures may reach the maximum number of retries due to hardware faults such as network and memory faults. As a result, the job fails.
To prevent repeated restarts when the failure-rate policy is used, configure parameters as follows:
```
restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 10 min
restart-strategy.failure-rate.delay: 10 s
```

Parent topic: Managing FlinkServer Jobs

Previous topic: Managing FlinkServer Jobs

Next topic: Using UDFs in FlinkServer Jobs

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.

The system is busy. Please try again later.