What Should I Do If the Structured Streaming Task Submission Way Is Changed?

Question

When submitting a structured streaming task, users need to run the --jars command to specify the Kafka JAR package path, for example, --jars /kafkadir/kafka-clients-x.x.x.jar,/kafkadir/kafka_2.11-x.x.x.jar. However, in the current version, users need to configure additional items. Otherwise, an error is reported, indicating that the class is not found.

Answer

The Spark kernel of the current version depends on the Kafka JAR package, which is used by the structured streaming. Therefore, when submitting a structured streaming task, you need to add the Kafka JAR package path to the library directory of the driver of this task to ensure that the driver can properly load the Kafka package.

Solution

The following operations need to be performed additionally when a structured streaming task in Yarn-client mode is submitted:
Copy the path of spark.driver.extraClassPath in the spark-default.conf file in the Spark client directory, and add the Kafka JAR package path to its end. When submitting a structured stream task, add the --conf statement to combine these two configuration items. For example, if the Kafka JAR package path is /kafkadir, you need to add --conf spark.driver.extraClassPath=/opt/client/Spark2x/spark/conf/:/opt/client/Spark2x/spark/jars/*:/opt/client/Spark2x/spark/x86/*:/kafkadir/* when submitting the task.
The following operations need to be performed additionally when a structured streaming task in Yarn-cluster mode is submitted:
Copy the path of spark.yarn.cluster.driver.extraClassPath in the spark-default.conf file in the Spark client directory, and add relative paths of Kafka JAR packages to its end. When submitting a structured stream task, add the --conf statement to combine these two configuration items. For example, if the Kafka JAR package paths are kafka-clients-x.x.x.jar and kafka_2.11-x.x.x.jar, you need to add --conf spark.yarn.cluster.driver.extraClassPath=/home/huawei/Bigdata/common/runtime/security:./kafka-clients-x.x.x.jar:./kafka_2.11-x.x.x.jar when submitting the task.
In the current version, the structured streaming of Spark does not support versions earlier than Kafka2.x. In the upgrade scenario, use the client of earlier versions.