Updated on 2024-10-09 GMT+08:00

Optimizing the Spark INSERT SELECT Statement

Scenario

The INSERT...SELECT operation can be optimized in the following scenarios:

  • Data in a large number of small files is queried.
  • Data in large files is queried.
  • A non-Spark user is used in Beeline/JDBCServer mode.

Procedure

The INSERT...SELECT operation can be optimized as follows:

  • When creating a Hive table, set the storage type to Parquet to accelerate execution of the INSERT...SELECT statement.
  • Use spark-sql or a Spark user in Beeline/JDBCServer mode to execute INSERT...SELECT operations. This eliminates the need for changing the file owner, which quickens INSERT...SELECT statement execution.

    In Beeline/JDBCServer mode, an executor and a driver are run by the same user. Because a driver is a part of JDBCServer and JDBCServer is run by a Spark user, the driver is also run by the Spark user. At present, the user of the Beeline client cannot be transparently transmitted to the executor during operation. If a non-Spark user is used, the owner of a file must be changed to the user of the Beeline client, that is, the actual user.

  • Querying a large number of small files triggers multiple map operations, which generates numerous small output files that are time-consuming to rename. To tackle this problem, you can set spark.sql.files.maxPartitionBytes and spark.files.openCostInBytes to limit the number of bytes read by a partition. This will consolidate multiple small files in a partition, reducing the number of output files and the time needed to rename them. Ultimately, this will shorten the execution time of the INSERT...SELECT statement.

These optimizations are ineffective in the following scenario:

A dynamic partition table has a large number of partitions.