Updated on 2022-06-01 GMT+08:00

Optimizing the Spark SQL Join Operation

Scenario

When two tables are joined in Spark SQL, the broadcast function (see section Using Broadcast Variables) can be used to broadcast small tables to each node. This minimizes shuffle operations and improves task execution efficiency.

The join operation refers to the inner join operation only.

Procedure

The following describes how to optimize the join operation in Spark SQL. Assume that both tables A and B have the name column. Join tables A and B as follows:

  1. Estimate the table sizes.

    Estimate the table size based on the size of data loaded each time.

    You can also check the table size in the directory of the Hive database. In the hive-site.xml configuration file of Spark, view the Hive database directory, which is /user/hive/warehouse by default.

    <property>
       <name>hive.metastore.warehouse.dir</name>
      <value>/user/hive/warehouse</value>  
    </property>

    Run the hadoop command to check the size of the table. For example, run the following command to view the size of table A:

    hadoop fs -du -s -h ${test.warehouse.dir}/a

    The tables must meet the following requirements for broadcasting:

    1. At least one table is not empty.
    2. Tables must not be external tables.
    3. The storage mode of the tables must be textfile (default value), for example,
      create table A( name string ) stored as textfile;
      or
      create table A( name string );
  2. Configure a threshold for automatic broadcast.

    The threshold for triggering broadcast for a table is 10485760 (that is, 10 MB) in Spark. If either of the table sizes is smaller than 10 MB, skip this step.

    Table 1 describes the parameter for configuring the automatic broadcast threshold.

    Table 1 Parameter

    Parameter

    Default Value

    Description

    spark.sql.autoBroadcastJoinThreshold

    10485760

    Specifies the maximum value for broadcast configuration when two tables are joined. If the table size is smaller than the parameter value, broadcast is performed. If the value is set to -1, broadcast is not performed.

    For details, visit https://spark.apache.org/docs/latest/sql-programming-guide.html.

    Configure the threshold for automatic broadcast as follows:

    • Set spark.sql.autoBroadcastJoinThreshold in the spark-defaults.conf configuration file of Spark. The value of <size> varies with scenarios and must be greater than the size of at least one table.
      spark.sql.autoBroadcastJoinThreshold = <size>
    • Run the Hive command to set the threshold. Before joining the tables, run the following command:
      SET spark.sql.autoBroadcastJoinThreshold=<size>

      The value of <size> varies with scenarios and must be greater than the size of at least one table.

  3. Join the tables.

    In this example, the size of at least one table is smaller than the threshold.

    If the sizes of both tables A and B are smaller than the threshold and the size of table A is smaller than that of table B, run the following command:

    SELECT A.name FROM B JOIN A ON A.name = B.name;

    If the size of table B is smaller than that of table A, run the following command:

    SELECT A.name FROM A JOIN B ON A.name = B.name;