Updated on 2024-12-10 GMT+08:00

Development Suggestions

Traffic Limiting Must Be Set When Kafka Is the Source

This rule is available for MRS 3.3.0 or later.

To prevent job exceptions caused by heavy traffic, set a traffic limit, which should be the peak value of the pressure test for service rollout.

[Example]

# The following parameter takes effect at any parallelism:
'scan.records-per-second.limit' = '1000'
# The actual traffic limit is as follows:
min( parallelism * scan.records-per-second.limit, partitions num * scan.records-per-second.limit)

Write Data with the Same Key to the Same Kafka Partition for Data Accuracy

Flink uses the fixed policy to write data to Kafka and performs hash calculation based on the key before writing data.

[Example]

CREATE TABLE kafka (
 f_sequence INT,
 f_sequence1 INT,
 f_sequence2 INT,
 f_sequence3 INT 
 ) WITH ( 
 'connector' = 'kafka',
 'topic' = 'yxtest123',
 'properties.bootstrap.servers' = '192.168.0.104:9092',
 'properties.group.id' = 'testGroup1',
 'scan.startup.mode' = 'latest-offset',
 'format' = 'json',
 'sink.partitioner'='fixed'
 ); 

 insert into kafka select /*+ DISTRIBUTEBY('f_sequence','f_sequence1') */ * from datagen;

Set Kafka Source Parallelism Same as the Number of Topic Partitions for Faster Kafka Consumption

When the parallelism of Kafka Source is greater than the number of topic partitions, no more data is consumed.