Updated on 2024-08-30 GMT+08:00

Development Rules

Do Not Add Over Five Dimension Tables In Lookup Join

Hudi dimension tables are stored in the TM heap memory. When there are too many dimension tables, heap memory stores too much dimension table data, and the TM will consistently trigger GC. As a result, the job performance deteriorates.

[Example] Set the number of dimension tables in a lookup join to 5:

CREATE TABLE table1(id  int, param1 string) with(...);
CREATE TABLE table2(id  int, param2 string) with(...);
CREATE TABLE table3(id  int, param3 string) with(...);
CREATE TABLE table4(id  int, param4 string) with(...);
CREATE TABLE table5(id  int, param5 string) with(...);
CREATE TABLE orders (
     order_id    STRING,
     price       DECIMAL(32,2),
     currency    STRING,
     order_time  TIMESTAMP(3),
     WATERMARK FOR order_time AS order_time
) WITH (/* ... */);

select 
    o.*, t1.param1, t2.param2, t3.param3, t4.param4, t5.param5
from 
    orders AS o
    JOIN table1 FOR SYSTEM_TIME AS OF o.proc_time AS t1 ON o.order_id = t1.id
    JOIN table2 FOR SYSTEM_TIME AS OF o.proc_time AS t2 ON o.order_id = t2.id
    JOIN table3 FOR SYSTEM_TIME AS OF o.proc_time AS t3 ON o.order_id = t3.id
    JOIN table4 FOR SYSTEM_TIME AS OF o.proc_time AS t4 ON o.order_id = t4.id
    JOIN table5 FOR SYSTEM_TIME AS OF o.proc_time AS t5 ON o.order_id = t5.id;

In a multi-stream join, the number of fact stream tables cannot exceed three.

When there are too many tables, the back-end pressure is too high, increasing the latency.

[Example] Join three dimension tables in real time.

CREATE TABLE table1(id  int, param1 string) with(...);
CREATE TABLE table2(id  int, param2 string) with(...);
CREATE TABLE table3(id  int, param3 string) with(...);
CREATE TABLE orders (
     order_id    STRING,
     price       DECIMAL(32,2),
     currency    STRING,
     order_time  TIMESTAMP(3),
     WATERMARK FOR order_time AS order_time
) WITH (/* ... */);

select 
    o.*, t1.param1, t2.param2, t3.param3
from 
    orders AS o
    JOIN table1 AS t1 ON o.order_id = t1.id
    JOIN table2 AS t2 ON o.order_id = t2.id
    JOIN table3 AS t3 ON o.order_id = t3.id;

The number of nested joins cannot exceed three.

A larger number of nesting levels indicates a larger amount of data to be withdrawn.

[Example] Nest three joins.

SELECT *
    FROM table1 WHERE column1 IN 
( 
   SELECT column1 
   FROM table2 WHERE column2 IN (
      SELECT column2
      FROM table3 WHERE column3 = 'value'
   )
)

A Single Lookup Join Table Should Not Be Larger than 1 GB in Hudi

Hudi dimension tables are stored in the TM heap memory. When a dimension table is too large, heap memory stores too much dimension table data, and the TM will consistently trigger GC. As a result, the job performance deteriorates.

Do Not Add Batch Source Operator to Stream Joins

Changed the Source operator to a dimension table operator based on service requirements.