Updated on 2025-08-22 GMT+08:00

Optimizing Datasource Tables

Scenarios

Save the partition information about the datasource table to the Metastore and process partition information in the Metastore.

  • Optimize the datasource tables, support syntax such as adding, deletion, and modification in the table based on partitions, improving compatibility with Hive.
  • Support statements of partition tailoring and push down to the Metastore to filter unmatched partitions.
    Example:
    select count(*) from table where partCol=1;    //partCol (partition column)

    You need only to process data corresponding to partCol=1 when performing the TableScan operation in the physical plan.

Procedure

  1. Install the Spark client.

    For details, see Installing a Client.

  2. Log in to the Spark client node as the client installation user.

    Modify the following parameters in the {Client installation directory}/Spark/spark/conf/spark-defaults.conf file on the Spark client.
    Table 1 Parameter description

    Parameter

    Description

    Example Value

    spark.sql.hive.manageFilesourcePartitions

    Specifies whether to enable Metastore partition management (including datasource tables and converted Hive).

    • true indicates enabling Metastore partition management. In this case, datasource tables are stored in Hive and Metastore is used to tailor partitions in query statements.
    • false indicates disabling Metastore partition management.

    true

    spark.sql.hive.metastorePartitionPruning

    Specifies whether to support pushing down predicate to Hive Metastore.

    • true indicates supporting pushing down predicate to Hive Metastore. Only the predicate of Hive tables is supported.
    • false indicates not supporting pushing down predicate to Hive Metastore.

    true

    spark.sql.hive.filesourcePartitionFileCacheSize

    The cache size of the partition file metadata in the memory.

    All tables share a cache that can use up to specified num bytes for file metadata.

    This parameter is valid only when spark.sql.hive.manageFilesourcePartitions is set to true.

    250 * 1024 * 1024

    spark.sql.hive.convertMetastoreOrc

    The processing approach of ORC tables.

    • false: Spark SQL uses Hive SerDe to process ORC tables.
    • true: Spark SQL uses the Spark built-in mechanism to process ORC tables.

    true