Updated on 2024-10-09 GMT+08:00

Configuring the FlinkServer Job Restart Policy

FlinkServer Job Restart Policies

Flink supports different restart policies to control whether and how to restart a job when a fault occurs. If no restart policy is specified, the cluster uses the default restart policy. You can also specify a restart policy when submitting a job. For details about how to configure such a policy on the job development page of MRS 3.1.0 or later, see Creating a FlinkServer Job.

The restart policy can be specified by configuring the restart-strategy parameter in the Flink configuration file Client installation directory/Flink/flink/conf/flink-conf.yaml or can be dynamically specified in the application code. The configuration takes effect globally. Restart policies include failure-rate and the following two default policies:

  • No restart: If CheckPoint is not enabled, this policy is used by default.
  • Fixed-delay: If CheckPoint is enabled but no restart policy is configured, this policy is used by default.

No restart Policy

When a fault occurs, the job fails and does not attempt to restart.

Configure the parameter as follows:

restart-strategy: none

fixed-delay Policy

When a fault occurs, the job attempts to restart for a fixed number of times. If the number of attempts exceeds the times you specified, the job fails. The restart policy waits for a fixed period of time between two consecutive restart attempts.

In the following example, a job fails if the job attempts to restart for three times at an interval of 10 seconds. Configure the parameters as follows:

restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 3
restart-strategy.fixed-delay.delay: 10 s

failure-rate Policy

When a job fails, the job restarts directly. If the failure rate exceeds the value you configured, the job is considered as failed. The restart policy waits for a fixed period of time between two consecutive restart attempts.

In the following example, a job is considered as failed if the job attempts to restart for three times at an interval of 10 minutes. Configure the parameters as follows:

restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 10 min
restart-strategy.failure-rate.delay: 10 s

Selecting a Restart Policy

  • If you do not want to retry a failed job, select the No restart policy.
  • To retry a failed job, select the failure-rate policy. If the fixed-delay policy is used, the number of job failures may reach the maximum number of retries due to hardware faults such as network and memory faults. As a result, the job fails.

    To prevent repeated restarts when the failure-rate policy is used, configure parameters as follows:

    restart-strategy: failure-rate
    restart-strategy.failure-rate.max-failures-per-interval: 3
    restart-strategy.failure-rate.failure-rate-interval: 10 min
    restart-strategy.failure-rate.delay: 10 s

Creating a FlinkServer Job

  1. Access the Flink web UI. For details, see Accessing the FlinkServer Web UI.
  2. Click Job Management. The job management page is displayed.
  3. Click Create Job. Create a Flink SQL job or Flink Jar job, enter job information, and click OK. The job is created and the job development page is displayed.
  4. (Optional) To develop a job immediately, configure the job on the job development page.

    The system allows you to add a lock to a job. The user who locks the job has all permissions of the job. Other users do not have the permissions to develop, start, or delete the locked job. However, they can forcibly acquire the lock to obtain all permissions. After this function is enabled, you can Lock and Unlock a job, or click Acquire Lock to obtain job permissions.

    Job locks are enabled by default. You can view the status of this function on FusionInsight Manager. This topic is available for MRS 3.3.0 or later only.

    Log in to FusionInsight Manager, choose Cluster > Service > Flink, click Configuration and then All Configurations, and search for the job.edit.lock.enable parameter. If the parameter value is true, the function is enabled. If the parameter value is false, the function is disabled.

    • Creating a Flink SQL job
      1. Develop the job on the job development page.
        Figure 1 FlinkServer job development page

      2. Click Check Semantic to check the input content and click Format SQL to format SQL statements.
      3. Set basic and customized parameters as required by referring to Table 1 and click Save.
        Table 1 Basic parameters

        Parameter

        Description

        Parallelism

        Number of parallel jobs

        Maximum Operator Parallelism

        Maximum degree of parallelism of operators

        JobManager Memory (MB)

        Memory of JobManager The minimum value is 4096.

        Submit Queue

        Queue to which a job is submitted. If this parameter is not set, the default queue is used.

        taskManager

        taskManager running parameters.

        • Slots: The default value is 1. You are advised to set this parameter to the number of CPU cores.
        • Memory (MB): The minimum value is 4096.

        Enable CheckPoint

        Whether to enable CheckPoint. After CheckPoint is enabled, you need to configure the following information:

        • Time Interval (ms): This parameter is mandatory.
        • Mode: This parameter is mandatory.

          The options are EXACTLY_ONCE and AT_LEAST_ONCE.

        • Minimum Interval (ms): The minimum value is 10.
        • Timeout Duration: The minimum value is 10.
        • Maximum Parallelism: The value must be a positive integer containing a maximum of 64 characters.
        • Whether to clean up: This parameter can be set to Yes or No.
        • Whether to enable incremental checkpoints: This parameter can be set to Yes or No.

        Failure Recovery Policy

        Failure recovery policy of a job. The options are as follows. For details, see Configuring the FlinkServer Job Restart Policy.

        • fixed-delay: You need to configure Retry Times and Retry Interval (s).
        • failure-rate: You need to configure Max Retry Times, Interval (min), and Retry Interval (s).
        • none
      4. Click Submit in the upper left corner to submit the job.
    • Creating a Flink JAR job
      1. Click Select to upload a local JAR file and set parameters by referring to Table 2 or add customized parameters.
        Table 2 Parameter configuration

        Parameter

        Description

        Local .jar File

        Upload a local JAR file. Upload a local file smaller than the threshold specified by flinkserver.upload.jar.max.size. The default value is 500 MB.

        Log in to FusionInsight Manager, choose Cluster > Services > Flink > Configurations > All Configurations, search for flinkserver.upload.jar.max.size, and set the JAR file threshold. The value ranges from 100 MB to 5,120 MB.

        Main Class

        Main-Class type.

        • Default: By default, the class name is specified based on the Mainfest file in the JAR file.
        • Specify: Manually specify the class name.

        Class Name

        Class name.

        This parameter is available when Main Class is set to Specify.

        Class Parameter

        Class parameters of Main-Class (parameters are separated by spaces).

        Parallelism

        Number of parallel jobs

        Concurrent tasks of each job operator. Appropriately increasing the value will improve the overall computing performance of a job. Considering switchover overheads due to increasing threads, the maximum value is four times the number of SPUs used by the computing unit. One to two times the number of SPUs of the computing unit is the optimal.

        JobManager Memory (MB)

        Memory of JobManager. The minimum value is 4096.

        Submit Queue

        Queue to which a job is submitted. If this parameter is not set, the default queue is used.

        taskManager

        taskManager running parameters.

        • Slots: The default value is 1. You are advised to set this parameter to the number of CPU cores.
        • Memory (MB): The minimum value is 4096.
      2. Click Save to save the configuration and click Submit to submit the job.

  5. Return to the job management page. You can view information about the created job, including job name, type, status, kind, and description.

    After a job is created, you can start, develop, stop, edit, and delete the job, view job details, and rectify checkpoint faults in the Operation column of the job.

    • To read files related to the submitted job on the node as another user, ensure that the user and the user who submitted the job belong to the same user group and the user has been assigned the FlinkServer application management role. For example, application view is selected by referring to Creating a FlinkServer Role.
    • You can view details about jobs in the Running state.
    • You can rectify checkpoint faults for jobs in the Running failed, Running succeeded, or Stop state.