Configuring the FlinkServer Job Restart Policy

FlinkServer Job Restart Policies

Flink supports different restart policies to control whether and how to restart a job when a fault occurs. If no restart policy is specified, the cluster uses the default restart policy. You can also specify a restart policy when submitting a job. For details about how to configure such a policy on the job development page of MRS 3.1.0 or later, see Creating a FlinkServer Job.

The restart policy can be specified by configuring the restart-strategy parameter in the Flink configuration file Client installation directory/Flink/flink/conf/flink-conf.yaml or can be dynamically specified in the application code. The configuration takes effect globally. Restart policies include failure-rate and the following two default policies:

No restart: If CheckPoint is not enabled, this policy is used by default.
Fixed-delay: If CheckPoint is enabled but no restart policy is configured, this policy is used by default.

No restart Policy

When a fault occurs, the job fails and does not attempt to restart.

Configure the parameter as follows:

restart-strategy: none

fixed-delay Policy

When a fault occurs, the job attempts to restart for a fixed number of times. If the number of attempts exceeds the times you specified, the job fails. The restart policy waits for a fixed period of time between two consecutive restart attempts.

In the following example, a job fails if the job attempts to restart for three times at an interval of 10 seconds. Configure the parameters as follows:

restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 3
restart-strategy.fixed-delay.delay: 10 s

failure-rate Policy

When a job fails, the job restarts directly. If the failure rate exceeds the value you configured, the job is considered as failed. The restart policy waits for a fixed period of time between two consecutive restart attempts.

In the following example, a job is considered as failed if the job attempts to restart for three times at an interval of 10 minutes. Configure the parameters as follows:

restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 10 min
restart-strategy.failure-rate.delay: 10 s

Selecting a Restart Policy

If you do not want to retry a failed job, select the No restart policy.
To retry a failed job, select the failure-rate policy. If the fixed-delay policy is used, the number of job failures may reach the maximum number of retries due to hardware faults such as network and memory faults. As a result, the job fails.
To prevent repeated restarts when the failure-rate policy is used, configure parameters as follows:
```
restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 10 min
restart-strategy.failure-rate.delay: 10 s
```

Creating a FlinkServer Job

The statements of a SQL job submitted on Flink Server are saved to the DBServer. In MRS 3.5.0 and later versions, Flink Server encrypts SQL storage by default to protect information. When "FlinkSQL" is displayed in the command output on the FlinkServer web UI, the password field in the SQL statement is left blank. Before you submit a job, enter the password. For a custom connector, the password field name must contain the keyword password to prevent the password being displayed on the page.

Disabling SQL encryption storage may cause password leak. You are advised to retain the default setting. If you still need to disable the function, perform the following operations:

(Optional) Back up jobs and then delete all jobs. For details about how to back up and import jobs, see Importing and Exporting FlinkServer Job Information.
Change the value of ENABLE_DB_ENCRYPT to false.
Log in to the active and standby FlinkServer nodes, set ENABLE_DB_ENCRYPT in the $BIGDATA_HOME//FusionInsight_Flink_x.x.x/x_x_FlinkServer/etc/flinkserver_service.properties file to false, save the file, and exit.
Restart the affected FlinkServer instance.
On FusionInsight Manager, choose Cluster > Services > Flink > Instances, select all FlinkServer instances, click More, and select Restart Instance to restart the instances.

Access the Flink web UI. For details, see Accessing the FlinkServer Web UI.
Click Job Management. The job management page is displayed.
Click Create Job. Create a Flink SQL job or Flink Jar job, enter job information, and click OK. The job is created and the job development page is displayed.

(Optional) To develop a job immediately, configure the job on the job development page.

The system allows you to add a lock to a job. The user who locks the job has all permissions of the job. Other users do not have the permissions to develop, start, or delete the locked job. However, they can forcibly acquire the lock to obtain all permissions. After this function is enabled, you can Lock and Unlock a job, or click Acquire Lock to obtain job permissions.

Job locks are enabled by default. You can view the status of this function on FusionInsight Manager. This topic is available for MRS 3.3.0 or later only.

Log in to FusionInsight Manager, choose Cluster > Service > Flink, click Configuration and then All Configurations, and search for the job.edit.lock.enable parameter. If the parameter value is true, the function is enabled. If the parameter value is false, the function is disabled.

Creating a Flink SQL job

Develop the job on the job development page.
Figure 1 FlinkServer job development page
Click Check Semantic to check the input content and click Format SQL to format SQL statements.

Set basic and customized parameters as required by referring to Table 1 and click Save.

**Table 1** Basic parameters
Parameter	Description
Parallelism	Number of parallel jobs
Maximum Operator Parallelism	Maximum degree of parallelism of operators
JobManager Memory (MB)	Memory of JobManager The minimum value is 4096.
Submit Queue	Queue to which a job is submitted. If this parameter is not set, the default queue is used.
taskManager	taskManager running parameters. Slots: The default value is 1. You are advised to set this parameter to the number of CPU cores. Memory (MB): The minimum value is 4096.
Enable CheckPoint	Whether to enable CheckPoint. After CheckPoint is enabled, you need to configure the following information: Time Interval (ms): This parameter is mandatory. Mode: This parameter is mandatory. The options are EXACTLY_ONCE and AT_LEAST_ONCE. Minimum Interval (ms): The minimum value is 10. Timeout Duration: The minimum value is 10. Maximum Parallelism: The value must be a positive integer containing a maximum of 64 characters. Whether to clean up: This parameter can be set to Yes or No. Whether to enable incremental checkpoints: This parameter can be set to Yes or No.
Failure Recovery Policy	Failure recovery policy of a job. The options are as follows. For details, see Configuring the FlinkServer Job Restart Policy. fixed-delay: You need to configure Retry Times and Retry Interval (s). failure-rate: You need to configure Max Retry Times, Interval (min), and Retry Interval (s). none

Click Submit in the upper left corner to submit the job.

Creating a Flink JAR job

Click Select to upload a local JAR file and set parameters by referring to Table 2 or add customized parameters.

**Table 2** Parameter configuration
Parameter	Description
Local .jar File	Upload a local JAR file. Upload a local file smaller than the threshold specified by flinkserver.upload.jar.max.size. The default value is 500 MB. Log in to FusionInsight Manager, choose Cluster > Services > Flink > Configurations > All Configurations, search for flinkserver.upload.jar.max.size, and set the JAR file threshold. The value ranges from 100 MB to 5,120 MB.
Main Class	Main-Class type. Default: By default, the class name is specified based on the Mainfest file in the JAR file. Specify: Manually specify the class name.
Class Name	Class name. This parameter is available when Main Class is set to Specify.
Class Parameter	Class parameters of Main-Class (parameters are separated by spaces).
Parallelism	Number of parallel jobs Concurrent tasks of each job operator. Appropriately increasing the value will improve the overall computing performance of a job. Considering switchover overheads due to increasing threads, the maximum value is four times the number of SPUs used by the computing unit. One to two times the number of SPUs of the computing unit is the optimal.
JobManager Memory (MB)	Memory of JobManager. The minimum value is 4096.
Submit Queue	Queue to which a job is submitted. If this parameter is not set, the default queue is used.
taskManager	taskManager running parameters. Slots: The default value is 1. You are advised to set this parameter to the number of CPU cores. Memory (MB): The minimum value is 4096.

Click Save to save the configuration and click Submit to submit the job.

Return to the job management page. You can view information about the created job, including job name, type, status, kind, and description.

After a job is created, you can start, develop, stop, edit, and delete the job, view job details, and rectify checkpoint faults in the Operation column of the job.
- To read files related to the submitted job on the node as another user, ensure that the user and the user who submitted the job belong to the same user group and the user has been assigned the FlinkServer application management role. For example, application view is selected by referring to Creating a FlinkServer Role.
- You can view details about jobs in the Running state.
- You can rectify checkpoint faults for jobs in the Running failed, Running succeeded, or Stop state.
- To set whether the checkpoints of failed or canceled jobs can be retained, log in to FusionInsight Manager and choose Cluster > Services > Flink, click Configurations and then All Configurations, search for and set the execution.checkpointing.externalized-checkpoint-retention parameter of FlinkServer.
  - DELETE_ON_CANCELLATION: Only checkpoints of failed jobs will be retained.
  - RETAIN_ON_CANCELLATION (default value in MRS3.5.0 or later): Checkpoints of failed or canceled jobs will be retained.
  - NO_EXTERNALIZED_CHECKPOINTS(default for MRS versions earlier than 3.5.0): Checkpoints of failed or canceled jobs will not be saved.