Esta página aún no está disponible en su idioma local. Estamos trabajando arduamente para agregar más versiones de idiomas. Gracias por tu apoyo.

On this page

Optimizing INSERT...SELECT Operation

Updated on 2022-09-14 GMT+08:00

Scenario

The INSERT...SELECT operation can be optimized in the following scenarios:

  • Data in a large number of small files is queried.
  • Data in large files is queried.
  • A non-Spark user is used in beeline/thriftserver mode.

Procedure

The INSERT...SELECT operation can be optimized as follows:

  • When creating a Hive table, set the storage type to Parquet to accelerate execution of the INSERT...SELECT statement.
  • Use spark-sql or a Spark user in beeline/thriftserver mode to execute INSERT...SELECT operations. This eliminates the need for changing the file owner, which quickens INSERT...SELECT statement execution.
    NOTE:

    In beeline/thriftserver mode, an executor and a driver are run by the same user. Because a driver is a part of ThriftServer and ThriftServer is run by a Spark user, the driver is also run by the Spark user. At present, the user of the beeline client cannot be transparently transmitted to the executor during operation. If a non-Spark user is used, the owner of a file must be changed to the user of the beeline client, that is, the actual user.

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback