What Should I Do If It Takes a Long Time for Spark SQL to Access Hive Partitioned Tables Before a Job Starts?

Symptom

When Spark SQL is used to access Hive partitioned tables stored in OBS, the access speed is slow and a large number of OBS query APIs are called.

Example SQL:

select a,b,c from test where b=xxx

Fault Locating

According to the configuration, the task should scan only the partition whose b is xxx. However, the task logs show that the task scans all partitions and then calculates the data whose b is xxx. As a result, the task calculation is slow. In addition, a large number of OBS requests are sent because all files need to be scanned.

By default, the execution plan optimization based on partition statistics is enabled on MRS, which is equivalent to automatic execution of Analyze Table. (The default configuration method is to set spark.sql.statistics.fallBackToHdfs to true. You can set this parameter to false.) After this function is enabled, table partition statistics are scanned during SQL execution and used as cost estimation in the execution plan. For example, small tables identified during cost evaluation are broadcast to each node in the memory for join operations, significantly reducing shuffle time. This function greatly optimizes performance in join scenarios, but increases the number of OBS calls.

Procedure

Set the following parameter in Spark SQL and then run the SQL statement:

set spark.sql.statistics.fallBackToHdfs=false;

Alternatively, run the --conf command to set this parameter to false before startup.

--conf spark.sql.statistics.fallBackToHdfs=false

Parent topic: Job Management

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.

The system is busy. Please try again later.