Optimizing INSERT...SELECT Operation
Scenario
The INSERT...SELECT operation can be optimized in the following scenarios:
- Data in a large number of small files is queried.
- Data in large files is queried.
- A non-Spark user is used in beeline/thriftserver mode.
Procedure
The INSERT...SELECT operation can be optimized as follows:
- When creating a Hive table, set the storage type to Parquet to accelerate execution of the INSERT...SELECT statement.
- Use spark-sql or a Spark user in beeline/thriftserver mode to execute INSERT...SELECT operations. This eliminates the need for changing the file owner, which quickens INSERT...SELECT statement execution.
In beeline/thriftserver mode, an executor and a driver are run by the same user. Because a driver is a part of ThriftServer and ThriftServer is run by a Spark user, the driver is also run by the Spark user. At present, the user of the beeline client cannot be transparently transmitted to the executor during operation. If a non-Spark user is used, the owner of a file must be changed to the user of the beeline client, that is, the actual user.
Last Article: Optimizing the Spark SQL Join Operation
Next Article: Spark Streaming Tuning
Did this article solve your problem?
Thank you for your score!Your feedback would help us improve the website.