Configuring Reliability of Interconnection Between Spark Streaming and Kafka

Scenarios

When the Spark Streaming application is connected to Kafka and the application is restarted, the application reads data from Kafka based on the last read topic offset and the latest offset of the current topic.

If the leader of a Kafka topic fails and the offset of the Kafka leader is greatly different from that of the Kafka follower, the Kafka follower and leader are switched over after the Kafka service is restarted. As a result, the offset of the topic decreases after the Kafka service is restarted.

If the Spark Streaming application keeps running, the start position for reading Kafka data is greater than the end position because the offset of the topic in Kafka decreases. As a result, the application cannot read data from Kafka and reports an error.
Before restarting the Kafka service, stop the Spark Streaming application. After the Kafka service is restarted, restart the Spark Streaming application to restore the application from the checkpoint. In this case, the Spark Streaming application records the offset position read before the termination and uses the position as the reference to read subsequent data. The Kafka offset decreases (for example, from 100,000 to 10,000). Spark Streaming consumes data only after the offset of the Kafka leader increases to 100,000. As a result, the newly sent data whose offset is between 10,000 and 100,000 is lost.

To resolve the preceding problem, you can configure reliability for Kafka connected to Spark Streaming. After the reliability function of connected Kafka is enabled:

When the Spark Streaming application is running and the offset of a topic in Kafka decreases, the start position for reading Kafka data will be set to the latest offset of the topic in Kafka, and the application will continue to read subsequent data.
If a task has been generated but not yet scheduled, and the read Kafka offset is higher than the latest offset of the topic in Kafka, the task will fail to execute.

When a significant number of tasks fail on a specific executor, Spark adds that executor to a blacklist to prevent further task deployment and execution on that node. To avoid this behavior, you can set spark.blacklist.enabled to disable the blacklist function that is enabled by default.
If the offset of a topic in Kafka decreases, the Spark Streaming application restarts to restore the unfinished tasks. If the read Kafka offset range is greater than the latest offset of the topic in Kafka, the task is directly discarded.

Notes and Constraints

If the state function is used in the Spark Streaming application, do not enable Kafka's reliability function.

Configuration

Install the Spark client.

For details, see Installing a Client.

Modify the following parameters in the {Client installation directory}/Spark/spark/conf/spark-defaults.conf file on the Spark client.

**Table 1** Parameter description
Parameter	Description	Example Value
spark.streaming.Kafka.reliability	Indicates whether to enable the reliability function for Kafka connected to Spark Streaming. true: The reliability function is enabled. false: The reliability function is disabled.	true