Help Center/
Data Lake Insight/
SQL Syntax Reference (To Be Offline)/
Flink Opensource SQL 1.10 Syntax Reference/
Data Manipulation Language (DML)/
Deduplication
Updated on 2023-03-06 GMT+08:00
Deduplication
Function
Deduplication removes rows that duplicate over a set of columns, keeping only the first one or the last one.
Syntax
SELECT [column_list] FROM ( SELECT [column_list], ROW_NUMBER() OVER ([PARTITION BY col1[, col2...]] ORDER BY time_attr [asc|desc]) AS rownum FROM table_name) WHERE rownum = 1
Description
- ROW_NUMBER(): Assigns a unique, sequential number to each row, starting with one.
- PARTITION BY col1[, col2...]: Specifies the partition columns, for example, the deduplicate key.
- ORDER BY time_attr [asc|desc]: Specifies the ordering column, it must be a time attribute. Currently Flink supports proctime only. Ordering by ASC means to keep the first row, ordering by DESC means to keep the last row.
- WHERE rownum = 1: The rownum = 1 is required for Flink to recognize this query is deduplication.
Precautions
None
Example
The following examples show how to remove duplicate rows on order_id. The proctime is an event time attribute.
SELECT order_id, user, product, number FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY proctime ASC) as row_num FROM Orders) WHERE row_num = 1;
Parent topic: Data Manipulation Language (DML)
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
The system is busy. Please try again later.
For any further questions, feel free to contact us through the chatbot.
Chatbot