Development Rules
Do Not Add Over Five Dimension Tables In Lookup Join
Hudi dimension tables are stored in the TM heap memory. When there are too many dimension tables, heap memory stores too much dimension table data, and the TM will consistently trigger GC. As a result, the job performance deteriorates.
[Example] Set the number of dimension tables in a lookup join to 5:
CREATE TABLE table1(id int, param1 string) with(...); CREATE TABLE table2(id int, param2 string) with(...); CREATE TABLE table3(id int, param3 string) with(...); CREATE TABLE table4(id int, param4 string) with(...); CREATE TABLE table5(id int, param5 string) with(...); CREATE TABLE orders ( order_id STRING, price DECIMAL(32,2), currency STRING, order_time TIMESTAMP(3), WATERMARK FOR order_time AS order_time ) WITH (/* ... */); select o.*, t1.param1, t2.param2, t3.param3, t4.param4, t5.param5 from orders AS o JOIN table1 FOR SYSTEM_TIME AS OF o.proc_time AS t1 ON o.order_id = t1.id JOIN table2 FOR SYSTEM_TIME AS OF o.proc_time AS t2 ON o.order_id = t2.id JOIN table3 FOR SYSTEM_TIME AS OF o.proc_time AS t3 ON o.order_id = t3.id JOIN table4 FOR SYSTEM_TIME AS OF o.proc_time AS t4 ON o.order_id = t4.id JOIN table5 FOR SYSTEM_TIME AS OF o.proc_time AS t5 ON o.order_id = t5.id;
In a multi-stream join, the number of fact stream tables cannot exceed three.
When there are too many tables, the back-end pressure is too high, increasing the latency.
[Example] Join three dimension tables in real time.
CREATE TABLE table1(id int, param1 string) with(...); CREATE TABLE table2(id int, param2 string) with(...); CREATE TABLE table3(id int, param3 string) with(...); CREATE TABLE orders ( order_id STRING, price DECIMAL(32,2), currency STRING, order_time TIMESTAMP(3), WATERMARK FOR order_time AS order_time ) WITH (/* ... */); select o.*, t1.param1, t2.param2, t3.param3 from orders AS o JOIN table1 AS t1 ON o.order_id = t1.id JOIN table2 AS t2 ON o.order_id = t2.id JOIN table3 AS t3 ON o.order_id = t3.id;
The number of nested joins cannot exceed three.
A larger number of nesting levels indicates a larger amount of data to be withdrawn.
[Example] Nest three joins.
SELECT * FROM table1 WHERE column1 IN ( SELECT column1 FROM table2 WHERE column2 IN ( SELECT column2 FROM table3 WHERE column3 = 'value' ) )
A Single Lookup Join Table Should Not Be Larger than 1 GB in Hudi
Hudi dimension tables are stored in the TM heap memory. When a dimension table is too large, heap memory stores too much dimension table data, and the TM will consistently trigger GC. As a result, the job performance deteriorates.
Do Not Add Batch Source Operator to Stream Joins
Changed the Source operator to a dimension table operator based on service requirements.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot