Deduplication

Function

Deduplication removes rows that duplicate over a set of columns, keeping only the first one or the last one.

Syntax Format

     SELECT [column_list]
FROM (
   SELECT [column_list],
     ROW_NUMBER() OVER ([PARTITION BY col1[, col2...]]
       ORDER BY time_attr [asc|desc]) AS rownum
   FROM table_name)
WHERE rownum = 1
 
 
  

Syntax Description

ROW_NUMBER(): Assigns a unique, sequential number to each row, starting with one.
PARTITION BY col1[, col2...]: specifies the column of the partition, for example, the key for deduplication.
ORDER BY time_attr [asc|desc]: Specifies the ordering column, which must be a time attribute. Currently Flink supports proctime only. Ordering by ASC means to keep the first row, ordering by DESC means to keep the last row.
WHERE rownum = 1: The rownum = 1 is required for Flink to recognize this query is deduplication.

Precautions

None

Example

The following examples show how to remove duplicate rows on order_id. The proctime is an event time attribute.

     SELECT order_id, user, product, number
  FROM (
     SELECT *,
         ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY proctime ASC) as row_num
     FROM Orders)
  WHERE row_num = 1;
 
 
  

Parent topic: Data Query Language (DQL) Syntax

Previous topic: Top-N

Next topic: Reserved Keywords

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

For any further questions, feel free to contact us through the chatbot.

Chatbot