Configuring Structured Streaming to Use RocksDB for State Store

Scenarios

State information is stored in the default HDFS-backed State Store. As the volume of state data grows, the JVM garbage collector (GC) must manage an increasing number of objects. This leads to longer GC cycles and higher memory overhead, which can significantly degrade performance. You can configure the spark.sql.streaming.stateStore.providerClass parameter to use RocksDB as the state backend.

RocksDB is an embedded key-value storage engine that stores data on local disks and is optimized for fast read and write operations RocksDB offers highly customizable memory management and compression algorithms that can be tuned for different application scenarios. It is widely adopted for storing large volumes of structured and semi-structured data and can efficiently process large-scale state data.

Notes and Constraints

This section applies only to MRS 3.3.0-LTS or later.

Parameters

Install the Spark client.

For details, see Installing a Client.

Modify the following parameters in the {Client installation directory}/Spark/spark/conf/spark-defaults.conf file on the Spark client.

Parameter	Description	Example Value
spark.sql.streaming.stateStore.providerClass	Class that manages state data for stateful stream queries. This class must be a subclass of StateStoreProvider and must have a zero argument constructor. Set this parameter to org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider to select RocksDB as the state backend.	org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider

Parameter

Description

Example Value

spark.sql.streaming.stateStore.providerClass

Class that manages state data for stateful stream queries. This class must be a subclass of StateStoreProvider and must have a zero argument constructor.

Set this parameter to org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider to select RocksDB as the state backend.

org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider

Parent topic: Spark Streaming Enterprise-Class Enhancements

Previous topic: Configuring Reliability of Interconnection Between Spark Streaming and Kafka

Next topic: Spark Core Performance Tuning