Updated on 2025-08-22 GMT+08:00

Configuring Structured Streaming to Use RocksDB for State Store

Scenarios

State information is stored in the default HDFS-backed State Store. As the volume of state data grows, the JVM garbage collector (GC) must manage an increasing number of objects. This leads to longer GC cycles and higher memory overhead, which can significantly degrade performance. You can configure the spark.sql.streaming.stateStore.providerClass parameter to use RocksDB as the state backend.

RocksDB is an embedded key-value storage engine that stores data on local disks and is optimized for fast read and write operations RocksDB offers highly customizable memory management and compression algorithms that can be tuned for different application scenarios. It is widely adopted for storing large volumes of structured and semi-structured data and can efficiently process large-scale state data.

Notes and Constraints

This section applies only to MRS 3.3.0-LTS or later.

Parameters

  1. Install the Spark client.

    For details, see Installing a Client.

  2. Log in to the Spark client node as the client installation user.

    Modify the following parameters in the {Client installation directory}/Spark/spark/conf/spark-defaults.conf file on the Spark client.

    Parameter

    Description

    Example Value

    spark.sql.streaming.stateStore.providerClass

    Class that manages state data for stateful stream queries. This class must be a subclass of StateStoreProvider and must have a zero argument constructor.

    Set this parameter to org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider to select RocksDB as the state backend.

    org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider