Updated on 2024-05-31 GMT+08:00

Common Configuration Items of Batch SQL Jobs

This section describes the common configuration items of the SQL syntax for DLI batch jobs.

Table 1 Common configuration items

Item

Default Value

Description

spark.sql.files.maxRecordsPerFile

0

Maximum number of records to be written into a single file. If the value is zero or negative, there is no limit.

spark.sql.shuffle.partitions

200

Default number of partitions used to filter data for join or aggregation.

spark.sql.dynamicPartitionOverwrite.enabled

false

Whether DLI overwrites the partitions where data will be written into during runtime. If you set this parameter to false, all partitions that meet the specified condition will be deleted before data overwrite starts. For example, if you set false and use INSERT OVERWRITE to write partition 2021-02 to a partitioned table that has the 2021-01 partition, this partition will be deleted.

If you set this parameter to true, DLI does not delete partitions before overwrite starts.

spark.sql.files.maxPartitionBytes

134217728

Maximum number of bytes to be packed into a single partition when a file is read.

spark.sql.badRecordsPath

-

Path of bad records.

spark.sql.legacy.correlated.scalar.query.enabled

false

  • If set to true:
    • When there is no duplicate data in a subquery, executing a correlated subquery does not require deduplication from the subquery's result.
    • If there is duplicate data in a subquery, executing a correlated subquery will result in an error. To resolve this, the subquery's result must be deduplicated using functions such as max() or min().
  • If set to false:

    Regardless of whether there is duplicate data in a subquery, executing a correlated subquery requires deduplicating the subquery's result using functions such as max() or min(). Otherwise, an error will occur.

spark.sql.keep.distinct.expandThreshold

-

  • Parameter description:

    When running queries with multidimensional analysis that include the count(distinct) function using the cube structure in Spark, the typical execution plan involves using the expand operator. However, this operation can cause query inflation. To avoid this issue, you are advised to configure the following settings:

    • spark.sql.keep.distinct.expandThreshold:

      Default value: -1, indicating that Spark's default expand operator is used.

      Setting the parameter to a specific value, such as 512, defines the threshold for query inflation. If the threshold is exceeded, the count(distinct) function will use the distinct aggregation operator to execute the query instead of the expand operator.

    • spark.sql.distinct.aggregator.enabled: whether to forcibly use the distinct aggregation operator. If set to true, spark.sql.keep.distinct.expandThreshold is not used.
  • Use case: Queries with multidimensional analysis that use the cube structure and may include multiple count(distinct) functions, as well as the cube or rollup operator.
  • Example of a typical use case:
    SELECT a1, a2, count(distinct b), count(distinct c) FROM test_distinct group by a1, a2 with cube

spark.sql.distinct.aggregator.enabled

false