Optimizing the Spark SQL Join Operation
Scenario
When two tables are joined in Spark SQL, the broadcast function (see section Using Broadcast Variables) can be used to broadcast small tables to each node. This minimizes shuffle operations and improves task execution efficiency.
The join operation refers to the inner join operation only.
Procedure
The following describes how to optimize the join operation in Spark SQL. Assume that both tables A and B have the name column. Join tables A and B as follows:
- Estimate the table sizes.
Estimate the table size based on the size of data loaded each time.
You can also check the table size in the directory of the Hive database. In the hive-site.xml configuration file of Spark, view the Hive database directory, which is /user/hive/warehouse by default.
<property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> </property>
Run the hadoop command to check the size of the table. For example, run the following command to view the size of table A:
hadoop fs -du -s -h ${test.warehouse.dir}/a
The tables must meet the following requirements for broadcasting:
- At least one table is not empty.
- Tables must not be external tables.
- The storage mode of the tables must be textfile (default value), for example,
create table A( name string ) stored as textfile; or create table A( name string );
- Configure a threshold for automatic broadcast.
The threshold for triggering broadcast for a table is 10485760 (that is, 10 MB) in Spark. If either of the table sizes is smaller than 10 MB, skip this step.
Table 1 describes the parameter for configuring the automatic broadcast threshold.
Table 1 Parameter Parameter
Default Value
Description
spark.sql.autoBroadcastJoinThreshold
10485760
Specifies the maximum value for broadcast configuration when two tables are joined. If the table size is smaller than the parameter value, broadcast is performed. If the value is set to -1, broadcast is not performed.
For details, visit https://spark.apache.org/docs/latest/sql-programming-guide.html.
Configure the threshold for automatic broadcast as follows:
- Set spark.sql.autoBroadcastJoinThreshold in the spark-defaults.conf configuration file of Spark. The value of <size> varies with scenarios and must be greater than the size of at least one table.
spark.sql.autoBroadcastJoinThreshold = <size>
- Run the Hive command to set the threshold. Before joining the tables, run the following command:
SET spark.sql.autoBroadcastJoinThreshold=<size>
The value of <size> varies with scenarios and must be greater than the size of at least one table.
- Set spark.sql.autoBroadcastJoinThreshold in the spark-defaults.conf configuration file of Spark. The value of <size> varies with scenarios and must be greater than the size of at least one table.
- Join the tables.
In this example, the size of at least one table is smaller than the threshold.
If the sizes of both tables A and B are smaller than the threshold and the size of table A is smaller than that of table B, run the following command:
SELECT A.name FROM B JOIN A ON A.name = B.name;
If the size of table B is smaller than that of table A, run the following command:
SELECT A.name FROM A JOIN B ON A.name = B.name;
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot