What Should I Do If the Method of Submitting Structured Streaming Tasks Is Changed?

Question

When submitting a structured streaming task, you need to run the --jars command to specify the Kafka JAR package path, for example, --jars /kafkadir/kafka-clients-x.x.x.jar,/kafkadir/kafka_2.11-x.x.x.jar. However, in the current version, you need to configure additional items. Otherwise, an error is reported, indicating that the class is not found.

Answer

To ensure proper loading of the Kafka package, the driver of a structured streaming task must have the Kafka JAR package path added to its library directory. This is because the Spark kernel of the current version relies on the Kafka JAR package for structured streaming.

Solution

The following operations need to be performed additionally when a structured streaming task in Yarn-client mode is submitted:
Copy the path of spark.driver.extraClassPath in the spark-default.conf file in the Spark client directory, and add the Kafka JAR package path to its end. When submitting a structured stream task, add the --conf statement to combine these two configuration items. For example, if the Kafka JAR package path is /kafkadir, you need to add --conf spark.driver.extraClassPath=/opt/client/Spark2x/spark/conf/:/opt/client/Spark2x/spark/jars/*:/opt/client/Spark2x/spark/x86/*:/kafkadir/* when submitting the task.
The following operations need to be performed additionally when a structured streaming task in Yarn-cluster mode is submitted:
Copy the path of spark.yarn.cluster.driver.extraClassPath in the spark-default.conf file in the Spark client directory, and add relative paths of Kafka JAR packages to its end. When submitting a structured stream task, add the --conf statement to combine these two configuration items. For example, if the Kafka JAR package paths are kafka-clients-x.x.x.jar and kafka_2.11-x.x.x.jar, you need to add --conf spark.yarn.cluster.driver.extraClassPath=/home/huawei/Bigdata/common/runtime/security:./kafka-clients-x.x.x.jar:./kafka_2.11-x.x.x.jar when submitting the task.
In the current version, the structured streaming of Spark does not support versions earlier than Kafka 2.x. In the upgrade scenario, use the client of earlier versions.