Using Broadcast Variables
Scenario
Broadcast distributes data sets to each node. It allows data to be obtained locally when a data set is needed during a Spark task. If broadcast is not used, data serialization will be scheduled to tasks each time when a task requires data sets. It is time-consuming and makes the task get bigger.
- If a data set will be used by each slice of a task, broadcast the data set to each node.
- When small and big tables need to be joined, broadcast small tables to each node. This eliminates the shuffle operation, changing the join operation into a common operation.
Procedure
When developing an application, add the following code to broadcast the testArr data to each node:
def main(args: Array[String]) {
...
val testArr: Array[Long] = new Array[Long](200)
val testBroadcast: Broadcast[Array[Long]] = sc.broadcast(testArr)
val resultRdd: RDD[Long] = inpputRdd.map(input => handleData(testBroadcast, input))
...
}
def handleData(broadcast: Broadcast[Array[Long]], input: String) {
val value = broadcast.value
...
} Last Article: Setting a Degree of Parallelism
Next Article: Using the External Shuffle Service to Improve Performance
Did this article solve your problem?
Thank you for your score!Your feedback would help us improve the website.