How Do I Optimize the INSERT OVERWRITE for Reading and Writing in Same Table?

Updated on 2024-10-09 GMT+08:00

View PDF

Scenario

If data needs to be inserted to the destination table using dynamic partitioning (update using historical partitions) and the destination table is the same as the data source table, running INSERT OVERWRITE on the source table may cause data loss or data inconsistency. To avoid this problem, you are advised to use a temporary table to process data, and then perform the INSERT OVERWRITE.

Procedure

The following table is taken as an example:

user_data(user_group int, user_name string, update_time timestamp);

In this table, user_group is the partitioning column. You need to sort the existing data by update time and update the user group information. To do so, perform the following steps:

On the Hive Beeline CLI, enable Hive dynamic partitioning.

set hive.exec.dynamic.partition=true;

set hive.exec.dynamic.partition.mode=nonstrict;
Create a temporary table for storing deduplicated data.

CREATE TABLE temp_user_data AS

SELECT * FROM (

SELECT *,

ROW_NUMBER() OVER(PARTITION BY user_group ORDER BY update_time DESC) as rank

FROM user_data

) tmp

WHERE rank = 1;
Use temporary table as the data source and insert data to the destination table.

INSERT OVERWRITE TABLE user_data

SELECT user_group, user_name, update_time

FROM temp_user_data;
Clear the temporary table.

DROP TABLE IF EXISTS temp_user_data;