El contenido no se encuentra disponible en el idioma seleccionado. Estamos trabajando continuamente para agregar más idiomas. Gracias por su apoyo.

Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive

Configuring the FlinkServer Job Restart Policy

Updated on 2024-12-13 GMT+08:00

FlinkServer Job Restart Policies

Flink supports different restart policies to control whether and how to restart a job when a fault occurs. If no restart policy is specified, the cluster uses the default restart policy. You can also specify a restart policy when submitting a job. For details about how to configure such a policy on the job development page of MRS 3.1.0 or later, see Creating a FlinkServer Job.

The restart policy can be specified by configuring the restart-strategy parameter in the Flink configuration file Client installation directory/Flink/flink/conf/flink-conf.yaml or can be dynamically specified in the application code. The configuration takes effect globally. Restart policies include failure-rate and the following two default policies:

  • No restart: If CheckPoint is not enabled, this policy is used by default.
  • Fixed-delay: If CheckPoint is enabled but no restart policy is configured, this policy is used by default.

No restart Policy

When a fault occurs, the job fails and does not attempt to restart.

Configure the parameter as follows:

restart-strategy: none

fixed-delay Policy

When a fault occurs, the job attempts to restart for a fixed number of times. If the number of attempts exceeds the times you specified, the job fails. The restart policy waits for a fixed period of time between two consecutive restart attempts.

In the following example, a job fails if the job attempts to restart for three times at an interval of 10 seconds. Configure the parameters as follows:

restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 3
restart-strategy.fixed-delay.delay: 10 s

failure-rate Policy

When a job fails, the job restarts directly. If the failure rate exceeds the value you configured, the job is considered as failed. The restart policy waits for a fixed period of time between two consecutive restart attempts.

In the following example, a job is considered as failed if the job attempts to restart for three times at an interval of 10 minutes. Configure the parameters as follows:

restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 10 min
restart-strategy.failure-rate.delay: 10 s

Selecting a Restart Policy

  • If you do not want to retry a failed job, select the No restart policy.
  • To retry a failed job, select the failure-rate policy. If the fixed-delay policy is used, the number of job failures may reach the maximum number of retries due to hardware faults such as network and memory faults. As a result, the job fails.

    To prevent repeated restarts when the failure-rate policy is used, configure parameters as follows:

    restart-strategy: failure-rate
    restart-strategy.failure-rate.max-failures-per-interval: 3
    restart-strategy.failure-rate.failure-rate-interval: 10 min
    restart-strategy.failure-rate.delay: 10 s

Creating a FlinkServer Job

The statements of a SQL job submitted on Flink Server are saved to the DBServer. In MRS 3.5.0 and later versions, Flink Server encrypts SQL storage by default to protect information. When "FlinkSQL" is displayed in the command output on the FlinkServer web UI, the password field in the SQL statement is left blank. Before you submit a job, enter the password. For a custom connector, the password field name must contain the keyword password to prevent the password being displayed on the page.

NOTE:

Disabling SQL encryption storage may cause password leak. You are advised to retain the default setting. If you still need to disable the function, perform the following operations:

  1. (Optional) Back up jobs and then delete all jobs. For details about how to back up and import jobs, see Importing and Exporting FlinkServer Job Information.
  2. Change the value of ENABLE_DB_ENCRYPT to false.

    Log in to the active and standby FlinkServer nodes, set ENABLE_DB_ENCRYPT in the $BIGDATA_HOME//FusionInsight_Flink_x.x.x/x_x_FlinkServer/etc/flinkserver_service.properties file to false, save the file, and exit.

  3. Restart the affected FlinkServer instance.

    On FusionInsight Manager, choose Cluster > Services > Flink > Instances, select all FlinkServer instances, click More, and select Restart Instance to restart the instances.

  1. Access the Flink web UI. For details, see Accessing the FlinkServer Web UI.
  2. Click Job Management. The job management page is displayed.
  3. Click Create Job. Create a Flink SQL job or Flink Jar job, enter job information, and click OK. The job is created and the job development page is displayed.
  4. (Optional) To develop a job immediately, configure the job on the job development page.

    The system allows you to add a lock to a job. The user who locks the job has all permissions of the job. Other users do not have the permissions to develop, start, or delete the locked job. However, they can forcibly acquire the lock to obtain all permissions. After this function is enabled, you can Lock and Unlock a job, or click Acquire Lock to obtain job permissions.

    NOTE:

    Job locks are enabled by default. You can view the status of this function on FusionInsight Manager. This topic is available for MRS 3.3.0 or later only.

    Log in to FusionInsight Manager, choose Cluster > Service > Flink, click Configuration and then All Configurations, and search for the job.edit.lock.enable parameter. If the parameter value is true, the function is enabled. If the parameter value is false, the function is disabled.

    • Creating a Flink SQL job
      1. Develop the job on the job development page.
        Figure 1 FlinkServer job development page

      2. Click Check Semantic to check the input content and click Format SQL to format SQL statements.
      3. Set basic and customized parameters as required by referring to Table 1 and click Save.
        Table 1 Basic parameters

        Parameter

        Description

        Parallelism

        Number of parallel jobs

        Maximum Operator Parallelism

        Maximum degree of parallelism of operators

        JobManager Memory (MB)

        Memory of JobManager The minimum value is 4096.

        Submit Queue

        Queue to which a job is submitted. If this parameter is not set, the default queue is used.

        taskManager

        taskManager running parameters.

        • Slots: The default value is 1. You are advised to set this parameter to the number of CPU cores.
        • Memory (MB): The minimum value is 4096.

        Enable CheckPoint

        Whether to enable CheckPoint. After CheckPoint is enabled, you need to configure the following information:

        • Time Interval (ms): This parameter is mandatory.
        • Mode: This parameter is mandatory.

          The options are EXACTLY_ONCE and AT_LEAST_ONCE.

        • Minimum Interval (ms): The minimum value is 10.
        • Timeout Duration: The minimum value is 10.
        • Maximum Parallelism: The value must be a positive integer containing a maximum of 64 characters.
        • Whether to clean up: This parameter can be set to Yes or No.
        • Whether to enable incremental checkpoints: This parameter can be set to Yes or No.

        Failure Recovery Policy

        Failure recovery policy of a job. The options are as follows. For details, see Configuring the FlinkServer Job Restart Policy.

        • fixed-delay: You need to configure Retry Times and Retry Interval (s).
        • failure-rate: You need to configure Max Retry Times, Interval (min), and Retry Interval (s).
        • none
      4. Click Submit in the upper left corner to submit the job.
    • Creating a Flink JAR job
      1. Click Select to upload a local JAR file and set parameters by referring to Table 2 or add customized parameters.
        Table 2 Parameter configuration

        Parameter

        Description

        Local .jar File

        Upload a local JAR file. Upload a local file smaller than the threshold specified by flinkserver.upload.jar.max.size. The default value is 500 MB.

        Log in to FusionInsight Manager, choose Cluster > Services > Flink > Configurations > All Configurations, search for flinkserver.upload.jar.max.size, and set the JAR file threshold. The value ranges from 100 MB to 5,120 MB.

        Main Class

        Main-Class type.

        • Default: By default, the class name is specified based on the Mainfest file in the JAR file.
        • Specify: Manually specify the class name.

        Class Name

        Class name.

        This parameter is available when Main Class is set to Specify.

        Class Parameter

        Class parameters of Main-Class (parameters are separated by spaces).

        Parallelism

        Number of parallel jobs

        Concurrent tasks of each job operator. Appropriately increasing the value will improve the overall computing performance of a job. Considering switchover overheads due to increasing threads, the maximum value is four times the number of SPUs used by the computing unit. One to two times the number of SPUs of the computing unit is the optimal.

        JobManager Memory (MB)

        Memory of JobManager. The minimum value is 4096.

        Submit Queue

        Queue to which a job is submitted. If this parameter is not set, the default queue is used.

        taskManager

        taskManager running parameters.

        • Slots: The default value is 1. You are advised to set this parameter to the number of CPU cores.
        • Memory (MB): The minimum value is 4096.
      2. Click Save to save the configuration and click Submit to submit the job.

  5. Return to the job management page. You can view information about the created job, including job name, type, status, kind, and description.

    After a job is created, you can start, develop, stop, edit, and delete the job, view job details, and rectify checkpoint faults in the Operation column of the job.

    NOTE:
    • To read files related to the submitted job on the node as another user, ensure that the user and the user who submitted the job belong to the same user group and the user has been assigned the FlinkServer application management role. For example, application view is selected by referring to Creating a FlinkServer Role.
    • You can view details about jobs in the Running state.
    • You can rectify checkpoint faults for jobs in the Running failed, Running succeeded, or Stop state.
    • To set whether the checkpoints of failed or canceled jobs can be retained, log in to FusionInsight Manager and choose Cluster > Services > Flink, click Configurations and then All Configurations, search for and set the execution.checkpointing.externalized-checkpoint-retention parameter of FlinkServer.
      • DELETE_ON_CANCELLATION: Only checkpoints of failed jobs will be retained.
      • RETAIN_ON_CANCELLATION (default value in MRS3.5.0 or later): Checkpoints of failed or canceled jobs will be retained.
      • NO_EXTERNALIZED_CHECKPOINTS(default for MRS versions earlier than 3.5.0): Checkpoints of failed or canceled jobs will not be saved.

Utilizamos cookies para mejorar nuestro sitio y tu experiencia. Al continuar navegando en nuestro sitio, tú aceptas nuestra política de cookies. Descubre más

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback