Configuring Structured Streaming to Use RocksDB for State Store
Scenarios
State information is stored in the default HDFS-backed State Store. As the volume of state data grows, the JVM garbage collector (GC) must manage an increasing number of objects. This leads to longer GC cycles and higher memory overhead, which can significantly degrade performance. You can configure the spark.sql.streaming.stateStore.providerClass parameter to use RocksDB as the state backend.
RocksDB is an embedded key-value storage engine that stores data on local disks and is optimized for fast read and write operations RocksDB offers highly customizable memory management and compression algorithms that can be tuned for different application scenarios. It is widely adopted for storing large volumes of structured and semi-structured data and can efficiently process large-scale state data.
Notes and Constraints
This section applies only to MRS 3.3.0-LTS or later.
Parameters
- Install the Spark client.
For details, see Installing a Client.
- Log in to the Spark client node as the client installation user.
Modify the following parameters in the {Client installation directory}/Spark/spark/conf/spark-defaults.conf file on the Spark client.
Parameter
Description
Example Value
spark.sql.streaming.stateStore.providerClass
Class that manages state data for stateful stream queries. This class must be a subclass of StateStoreProvider and must have a zero argument constructor.
Set this parameter to org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider to select RocksDB as the state backend.
org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot