Why the "Class Does not Exist" Error Is Reported While the SparkStresmingKafka Project Is Running?

Question

When the KafkaWordCount task (org.apache.spark.examples.streaming.KafkaWordCount) is being submitted by running the spark-submit script, the log file shows that the Kafka-related class does not exist. The KafkaWordCount sample is provided by the Spark open-source community.

Answer

When Spark is deployed, the following jar packages are saved in the ${SPARK_HOME}/jars/streamingClient010 directory on the client and the ${BIGDATA_HOME}/FusionInsight_Spark2x_8.1.0.1/install/FusionInsight-Spark2x-3.1.1/spark/jars/streamingClient010 directory on the server.

kafka-clients-xxx.jar
kafka_2.12-xxx.jar
spark-streaming-kafka-0-10_2.12-3.1.1-hw-ei-311001-SNAPSHOT.jar
spark-token-provider-kafka-0-10_2.12-3.1.1-hw-ei-311001-SNAPSHOT.jar

Because $SPARK_HOME/jars/streamingClient010/* is not added in to classpath by default, you need to configure manually.

When the application is submitted and run, add following parameters in the command. For details, see Commissioning a Spark Application in a Linux Environment.

--jars $SPARK_CLIENT_HOME/jars/streamingClient010/kafka-clients-2.4.0.jar,$SPARK_CLIENT_HOME/jars/streamingClient010/kafka_2.12-*.jar,$SPARK_CLIENT_HOME/jars/streamingClient010/spark-streaming-kafka-0-10_2.12-3.1.1-hw-ei-311001-SNAPSHOT.jar

You can run the preceding command to submit the self-developed applications and sample projects.

To submit the sample projects such as KafkaWordCount provided by Spark open source community, you need to add other parameters in addition to --jars. Otherwise, the ClassNotFoundException error will occur. The configurations in yarn-client and yarn-cluster modes are as follows:

yarn-client mode
In the configuration file spark-defaults.conf on the client, add the path of the client dependency package, for example, $SPARK_HOME/jars/streamingClient010/*, (in addition to --jars) to the spark.driver.extraClassPath parameter.
yarn-cluster mode
Perform any one of the following configurations in addition to --jars:
- In the configuration file spark-defaults.conf on the client, add the path of the server dependency package, for example, ${BIGDATA_HOME}/FusionInsight_Spark2x_8.1.0.1/install/FusionInsight-Spark2x-3.1.1/spark/jars/streamingClient010/*, to the spark.yarn.cluster.driver.extraClassPath parameter.
- Delete the original-spark-examples_2.12-3.1.1-xxx.jar packages from all the server nodes.
- In the spark-defaults.conf configuration file on the client, modify (or add and modify) the spark.driver.userClassPathFirst parameter to true.