What Is the Recommended Configuration for a Flink Job?

When you create a Flink job, you can perform the following operations to ensure high reliability of stream applications:

Create an SMN topic and add an email address or mobile number to subscribe to the topic. You will receive a subscription notification by an email or message. Click the link to confirm the subscription.
Figure 1 Creating a topic

Figure 2 Adding a subscription
Log in to the DLI console, create a Flink SQL job, write SQL statements for the job, and configure running parameters.

The reliability configuration of a Flink Jar job is the same as that of a SQL job.
1. Set CUs, Job Manager CUs, and Max Concurrent Jobs based on the following formulas:
  Total number of CUs = Number of manager CUs + (Total number of concurrent operators / Number of slots of a TaskManager) x Number of TaskManager CUs
  
  For example, if the total number of CUs is 9, the number of manager CUs is 1, and the maximum number of concurrent jobs is 16, the number of compute-specific CUs is 8.
  
  If you do not configure TaskManager specification, a TaskManager occupies 1 CU by default and has no slot. To ensure a high reliability, set the number of slots of the TaskManager to 2, according to the preceding formula.
  
  Set the maximum number of concurrent jobs twice the number of CUs.
2. Select Save Job Log and select an OBS bucket. If the bucket is not authorized, click Authorize. This allows job logs be saved to your OBS bucket after a job fails for fault locating.
  Figure 3 Specifying a bucket
3. Select Alarm Generation upon Job Exception and select the SMN topic created in 1. This allows DLI to send notifications to your email box or phone when a job exception occurs, so you can be aware of any exceptions in time.
  Figure 4 Alarm generation upon job exception
4. Select Enable Checkpointing and set the checkpoint interval and mode as needed. This function ensures that a failed Flink task can be restored from the latest checkpoint.
  Figure 5 Checkpoint parameters
  - Checkpoint interval refers to the time interval between two consecutive checkpoint triggers. The checkpoint mechanism has an impact on real-time computing performance. When configuring the interval, consider the impact on business performance and recovery duration. You are advised to set the interval to be greater than the checkpoint completion duration, preferably 5 minutes.
  - The Exactly once mode ensures that each piece of data is consumed only once, and the At least once mode ensures that each piece of data is consumed at least once. Select a mode as you need.
5. Select Auto Restart upon Exception and Restore Job from Checkpoint, and set the number of retry attempts as needed.
6. Configure Dirty Data Policy. You can select Ignore, Trigger a job exception, or Save based on your service requirements.
7. Select a queue and submit and run the job.
Log in to the Cloud Eye console. In the navigation pane on the left, choose Cloud Service Monitoring > Data Lake Insight. Locate the target Flink job and click Create Alarm Rule.
Figure 6 Cloud service monitoring

Figure 7 Creating an alarm rule

DLI provides various monitoring metrics for Flink jobs. You can define alarm rules as required using different monitoring metrics for fine-grained job monitoring.

For details about the monitoring metrics, see DLI Monitoring Metrics in Data Lake Insight User Guide.