Deduplication

Function

Deduplication removes rows that duplicate over a set of columns, keeping only the first one or the last one.

Syntax

SELECT [column_list]
FROM (
   SELECT [column_list],
     ROW_NUMBER() OVER ([PARTITION BY col1[, col2...]]
       ORDER BY time_attr [asc|desc]) AS rownum
   FROM table_name)
WHERE rownum = 1

Description

ROW_NUMBER(): Assigns a unique, sequential number to each row, starting with one.
PARTITION BY col1[, col2...]: Specifies the partition columns, for example, the deduplicate key.
ORDER BY time_attr [asc|desc]: Specifies the ordering column, which must be a time attribute. Currently Flink supports proctime only. Ascending (ASC) sorting keeps only the first row, while descending (DESC) sorting keeps only the last row.
WHERE rownum = 1: The rownum = 1 is required for Flink to recognize this query is deduplication.

Precautions

None

Example

The following examples show how to remove duplicate rows on order_id. The proctime is an event time attribute.

SELECT order_id, user, product, number
  FROM (
     SELECT *,
         ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY proctime ASC) as row_num
     FROM Orders)
  WHERE row_num = 1;

Parent topic: Data Manipulation Language (DML)

Previous topic: Top-N

Next topic: Functions

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.

The system is busy. Please try again later.