Help Center/ Data Lake Insight/ FAQs/ Flink Jobs/ Performance Tuning/ What Is the Recommended Configuration for a Flink Job?

Updated on 2023-03-17 GMT+08:00

View PDF

What Is the Recommended Configuration for a Flink Job?

When you create a Flink job, you can perform the following operations to ensure high reliability of stream applications:

Create an SMN topic and add an email address or mobile number to subscribe to the topic. You will receive a subscription notification by an email or message. Click the link to confirm the subscription.
Figure 1 Creating a topic

Figure 2 Adding a subscription
Log in to the DLI console, create a Flink SQL job, write SQL statements for the job, and configure running parameters.

The reliability configuration of a Flink Jar job is the same as that of a SQL job.
1. Set CUs, Job Manager CUs, and Max Concurrent Jobs based on the following formulas:
  Total number of CUs = Number of manager CUs + (Total number of concurrent operators / Number of slots of a TaskManager) x Number of TaskManager CUs
  
  For example, if the total number of CUs is 9, the number of manager CUs is 1, and the maximum number of concurrent jobs is 16, the number of compute-specific CUs is 8.
  
  If you do not configure TaskManager specification, a TaskManager occupies 1 CU by default and has no slot. To ensure a high reliability, set the number of slots of the TaskManager to 2, according to the preceding formula.
  
  Set the maximum number of concurrent jobs twice the number of CUs.
2. Select Save Job Log and select an OBS bucket. If the bucket is not authorized, click Authorize. This allows job logs be saved to your OBS bucket after a job fails for fault locating.
  Figure 3 Specifying a bucket
3. Select Alarm Generation upon Job Exception and select the SMN topic created in 1. This allows DLI to send notifications to your email box or phone when a job exception occurs, so you can be aware of any exceptions in time.
  Figure 4 Alarm generation upon job exception
4. Select Enable Checkpointing and set the checkpoint interval and mode as needed. This function ensures that a failed Flink task can be restored from the latest checkpoint.
  Figure 5 Checkpoint parameters
  - Checkpoint interval indicates the interval between two triggers. Checkpointing hurts real-time computing performance. To minimize the performance loss, you need to allow for the recovery duration when configuring the interval. It is recommended that the checkpoint interval be greater than the checkpointing duration. The recommended value is 5 minutes.
  - The Exactly once mode ensures that each piece of data is consumed only once, and the At least once mode ensures that each piece of data is consumed at least once. Select a mode as you need.
5. Select Auto Restart upon Exception and Restore Job from Checkpoint, and set the number of retry attempts as needed.
6. Configure Dirty Data Policy. You can select Ignore, Trigger a job exception, or Save based on your service requirements.
7. Select a queue and submit and run the job.
Log in to the Cloud Eye console. In the navigation pane on the left, choose Cloud Service Monitoring > Data Lake Insight. Locate the target Flink job and click Create Alarm Rule.
Figure 6 Cloud service monitoring

Figure 7 Creating an alarm rule

DLI provides various monitoring metrics for Flink jobs. You can define alarm rules as required using different monitoring metrics for fine-grained job monitoring.