Configuring Spark Core Broadcasting Variables
Scenario
Broadcasting datasets to each node allows for local access during Spark tasks, eliminating the need for data serialization to be scheduled to tasks each time data sets are required. This not only saves time but also prevents tasks from becoming larger.
- To ensure that a dataset is available to every slice of a task, it is recommended to broadcast the dataset to each node.
- To avoid the shuffle operation and simplify the join process when working with small and large tables, it is best to broadcast the small tables to each node.
Procedure
When developing an application, add the following code to broadcast the testArr data to each node:
def main(args: Array[String]) { ... val testArr: Array[Long] = new Array[Long](200) val testBroadcast: Broadcast[Array[Long]] = sc.broadcast(testArr) val resultRdd: RDD[Long] = inpputRdd.map(input => handleData(testBroadcast, input)) ... } def handleData(broadcast: Broadcast[Array[Long]], input: String) { val value = broadcast.value ... }
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot