Optimizing the Spark INSERT SELECT Statement

The INSERT...SELECT operation can be optimized in the following scenarios:

The INSERT...SELECT operation can be optimized as follows:

When creating a Hive table, set the storage type to Parquet to accelerate execution of the INSERT...SELECT statement.
Use spark-sql or a Spark user in Beeline/JDBCServer mode to execute INSERT...SELECT operations. This eliminates the need for changing the file owner, which quickens INSERT...SELECT statement execution.

In Beeline/JDBCServer mode, an executor and a driver are run by the same user. Because a driver is a part of JDBCServer and JDBCServer is run by a Spark user, the driver is also run by the Spark user. At present, the user of the Beeline client cannot be transparently transmitted to the executor during operation. If a non-Spark user is used, the owner of a file must be changed to the user of the Beeline client, that is, the actual user.
Querying a large number of small files triggers multiple map operations, which generates numerous small output files that are time-consuming to rename. To tackle this problem, you can set spark.sql.files.maxPartitionBytes and spark.files.openCostInBytes to limit the number of bytes read by a partition. This will consolidate multiple small files in a partition, reducing the number of output files and the time needed to rename them. Ultimately, this will shorten the execution time of the INSERT...SELECT statement.