更新时间:2024-05-28 GMT+08:00

Flink SQL逻辑开发规则

维表lookup join场景维度表个数不超过五个

Hudi维度表都在TM heap中,当维表过多时heap中保存的维表数据过多,TM会不断GC,导致作业性能下降。

【示例】lookup join维表数5个:

CREATE TABLE table1(id  int, param1 string) with(...);
CREATE TABLE table2(id  int, param2 string) with(...);
CREATE TABLE table3(id  int, param3 string) with(...);
CREATE TABLE table4(id  int, param4 string) with(...);
CREATE TABLE table5(id  int, param5 string) with(...);
CREATE TABLE orders (
     order_id    STRING,
     price       DECIMAL(32,2),
     currency    STRING,
     order_time  TIMESTAMP(3),
     WATERMARK FOR order_time AS order_time
) WITH (/* ... */);

select 
    o.*, t1.param1, t2.param2, t3.param3, t4.param4, t5.param5
from 
    orders AS o
    JOIN table1 FOR SYSTEM_TIME AS OF o.proc_time AS t1 ON o.order_id = t1.id
    JOIN table2 FOR SYSTEM_TIME AS OF o.proc_time AS t2 ON o.order_id = t2.id
    JOIN table3 FOR SYSTEM_TIME AS OF o.proc_time AS t3 ON o.order_id = t3.id
    JOIN table4 FOR SYSTEM_TIME AS OF o.proc_time AS t4 ON o.order_id = t4.id
    JOIN table5 FOR SYSTEM_TIME AS OF o.proc_time AS t5 ON o.order_id = t5.id;

多流Join场景事实流表个数不超过三个

当Join表过多时,状态后端压力太大会导致端到端时延增加。

【示例】实时Join维表数3个:

CREATE TABLE table1(id  int, param1 string) with(...);
CREATE TABLE table2(id  int, param2 string) with(...);
CREATE TABLE table3(id  int, param3 string) with(...);
CREATE TABLE orders (
     order_id    STRING,
     price       DECIMAL(32,2),
     currency    STRING,
     order_time  TIMESTAMP(3),
     WATERMARK FOR order_time AS order_time
) WITH (/* ... */);

select 
    o.*, t1.param1, t2.param2, t3.param3
from 
    orders AS o
    JOIN table1 AS t1 ON o.order_id = t1.id
    JOIN table2 AS t2 ON o.order_id = t2.id
    JOIN table3 AS t3 ON o.order_id = t3.id;

关联嵌套层级不超过三层

嵌套层级越多,回撤流的的数据量越大。

【示例】关联嵌套3层:

SELECT *
    FROM table1 WHERE column1 IN 
( 
   SELECT column1 
   FROM table2 WHERE column2 IN (
      SELECT column2
      FROM table3 WHERE column3 = 'value'
   )
)

基于Hudi表的lookup join单表数据量不超过1GB

Hudi维度表都在TM heap中,当维表过大时heap中保存的维表数据过多,TM会不断GC导致作业性能下降。

流流关联中不能加入批Source算子

流流关联中不能加入批Source算子,根据业务情况将该Source算子调整为维表算子。