Optimizing INSERT...SELECT Operation
Scenario
The INSERT...SELECT operation can be optimized in the following scenarios:
- Data in a large number of small files is queried.
- Data in large files is queried.
- A non-Spark user is used in beeline/thriftserver mode.
Procedure
The INSERT...SELECT operation can be optimized as follows:
- When creating a Hive table, set the storage type to Parquet to accelerate execution of the INSERT...SELECT statement.
- Use spark-sql or a Spark user in beeline/thriftserver mode to execute INSERT...SELECT operations. This eliminates the need for changing the file owner, which quickens INSERT...SELECT statement execution.
In beeline/thriftserver mode, an executor and a driver are run by the same user. Because a driver is a part of ThriftServer and ThriftServer is run by a Spark user, the driver is also run by the Spark user. At present, the user of the beeline client cannot be transparently transmitted to the executor during operation. If a non-Spark user is used, the owner of a file must be changed to the user of the beeline client, that is, the actual user.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.