How Do I Optimize the INSERT OVERWRITE for Reading and Writing in Same Table?
Scenario
If data needs to be inserted to the destination table using dynamic partitioning (update using historical partitions) and the destination table is the same as the data source table, running INSERT OVERWRITE on the source table may cause data loss or data inconsistency. To avoid this problem, you are advised to use a temporary table to process data, and then perform the INSERT OVERWRITE.
Procedure
The following table is taken as an example:
user_data(user_group int, user_name string, update_time timestamp);
In this table, user_group is the partitioning column. You need to sort the existing data by update time and update the user group information. To do so, perform the following steps:
- On the Hive Beeline CLI, enable Hive dynamic partitioning.
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
- Create a temporary table for storing deduplicated data.
CREATE TABLE temp_user_data AS
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY user_group ORDER BY update_time DESC) as rank
FROM user_data
) tmp
WHERE rank = 1;
- Use temporary table as the data source and insert data to the destination table.
INSERT OVERWRITE TABLE user_data
SELECT user_group, user_name, update_time
FROM temp_user_data;
- Clear the temporary table.
DROP TABLE IF EXISTS temp_user_data;
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.