Optimizing Memory When Data Is Inserted into Spark Dynamic Partitioned Tables

Scenarios

When Spark SQL inserts data into dynamic partitioned tables, the more partitions there are, the more HDFS files a single task generates and the more memory metadata occupies. In this case, GC is severe and OOM errors may occur.

Based on the test results, there are 10,240 tasks and 2,000 partitions. Before renaming an HDFS file from the temporary to the target directory, the FileStatus metadata is approximately 29 GB in size. To avoid the preceding problem, you can modify the SQL statement to repartition data to reduce the number of HDFS files.

Procedure

Insert distribute by followed by partition fields into dynamic partition statements.

Example:

insert into table store_returns partition (sr_returned_date_sk) select sr_return_time_sk,sr_item_sk,sr_customer_sk,sr_cdemo_sk,sr_hdemo_sk,sr_addr_sk,sr_store_sk,sr_reason_sk,sr_ticket_number,sr_return_quantity,sr_return_amt,sr_return_tax,sr_return_amt_inc_tax,sr_fee,sr_return_ship_cost,sr_refunded_cash,sr_reversed_charge,sr_store_credit,sr_net_loss,sr_returned_date_sk from ${SOURCE}.store_returns distribute by sr_returned_date_sk;

Parent topic: Spark SQL Performance Tuning

Previous topic: Configuring the Default Number of Data Blocks Divided by SparkSQL

Next topic: Optimizing Small Files