Updated on 2022-09-14 GMT+08:00

Optimizing INSERT...SELECT Operation

Scenario

The INSERT...SELECT operation can be optimized in the following scenarios:

  • Data in a large number of small files is queried.
  • Data in large files is queried.
  • A non-Spark user is used in beeline/thriftserver mode.

Procedure

The INSERT...SELECT operation can be optimized as follows:

  • When creating a Hive table, set the storage type to Parquet to accelerate execution of the INSERT...SELECT statement.
  • Use spark-sql or a Spark user in beeline/thriftserver mode to execute INSERT...SELECT operations. This eliminates the need for changing the file owner, which quickens INSERT...SELECT statement execution.

    In beeline/thriftserver mode, an executor and a driver are run by the same user. Because a driver is a part of ThriftServer and ThriftServer is run by a Spark user, the driver is also run by the Spark user. At present, the user of the beeline client cannot be transparently transmitted to the executor during operation. If a non-Spark user is used, the owner of a file must be changed to the user of the beeline client, that is, the actual user.