Updated on 2024-08-30 GMT+08:00

Performance Tuning Rules

Run Compaction on Hudi Tables to Prevent Long Checkpointing of the Hudi Source Operator

If the checkpointing of the Hudi Source operator takes a long time, check whether the compaction of the Hudi table is normal. If there was no compaction for a long time, the list performance deteriorates.

Set Table TTL to Reduce the Backend Data Volume When Joining a Fact Table and a Dimension Table

For details, see Optimize State Backends Through Table-Level TTL.

Set Proper Degree of Parallelism

The processing speed of tasks is related to parallelism. Generally, increasing parallelism can effectively improve read speed. However, if parallelism is too high, some node resources may be wasted, and if parallelism is too low, some nodes may run tasks slowly. A SQL statement cannot set parallelism for a specific task. You can set one for all.

Set source parallelism based on the upstream component. For a streaming system, the parallelism is recommended to be the same as the number of upstream partitions (for example, the number of Kafka topic partitions). For a batch system, the parallelism is recommended to be the same as the number of upstream slices (for example, the number of HDFS blocks).

The parallelism of Flink jobs using Source, Sink, and intermediate computing operators should be adjusted. If intermediate computing is busy according to the job flow diagram, you need to adjust the parallelism of the job to change the parallelism of the operators, for example, the join operator.