Configuring Spark Core Broadcasting Variables

Scenario

Broadcast distributes data sets to each node. It allows data to be obtained locally when a dataset is needed during a Spark task. If broadcast is not used, data serialization will be scheduled to tasks each time when a task requires data sets. It is time-consuming and makes the task get bigger.

If a dataset will be used by each slice of a task, broadcast the dataset to each node.
To avoid the shuffle operation and simplify the join process when working with small and large tables, it is best to broadcast the small tables to each node.

Procedure

When developing an application, add the following code to broadcast the testArr data to each node:

def main(args: Array[String]) {
  ...
  val testArr: Array[Long] = new Array[Long](200)
  val testBroadcast: Broadcast[Array[Long]] = sc.broadcast(testArr)
  val resultRdd: RDD[Long] = inpputRdd.map(input => handleData(testBroadcast, input))
  ...
}

def handleData(broadcast: Broadcast[Array[Long]], input: String) {
  val value = broadcast.value
  ...
}

Parent topic: Spark Core Performance Tuning

Previous topic: Spark Core Memory Tuning

Next topic: Configuring Heap Memory Parameters for Spark Executor