Updated on 2025-08-09 GMT+08:00

Restarting an MRS Cluster Component

During MRS cluster running, restart the components in the cluster if you have modified component settings, encountered infrastructure resource faults, or detected service process errors.

Components within an MRS cluster support both the standard restart and rolling restart.

  • Standard restart: concurrently restarts all components or instances in the cluster, which may interrupt services.
  • Rolling restart: restarts required components or instances without interrupting services as much as possible. Compared with a standard restart, a rolling restart takes a longer time and may affect service throughput and performance. To minimize or eliminate the impact on services during a component restart, you can perform rolling restarts to restart components or instances in batches. For instances in active/standby mode, the standby instance is restarted first, followed by the active instance.

    Table 4 describes the impact on services when a rolling restart is performed on components.

Restarting a cluster will stop the cluster components from providing services, which adversely affects the running of upper-layer applications or jobs. You are advised to perform rolling restarts during off-peak hours.

For details about whether services and instances in the current MRS cluster support rolling restart and the rolling restart parameters, see Component Restart Reference Information.

Notes and Constraints

  • Perform a rolling restart during off-peak hours.
    • If the service throughput of the Kafka service is high (over 100 MB/s) during a rolling restart, the restart will fail.
    • To avoid RegionServer restart failures caused by heavy loads during an HBase rolling restart, increase the number of handles if the requests per second of each RegionServer on the native interface exceed 10,000.
  • Before restarting, check the current number of requests in HBase. If the number of requests on the native interface for each RegionServer is over 10,000, increase the number of handles to prevent overloading.
  • Preferentially perform a rolling instance or service restart and select Only restart instances whose configurations have expired.

Impact on the System

  • If the number of Core nodes in a cluster is less than six, services may be affected for a short period of time.
  • Table 4 describes the impact of a component rolling restart.

Prerequisites

  • The IAM users have been synchronized in advance. You can do this by clicking Synchronize next to IAM User Sync on the Dashboard page of the cluster details.
  • You have logged in to MRS Manager. For how to log in, see Accessing MRS Manager.

Restarting an MRS Cluster Component

  1. Access the MRS cluster component management page.

    • Log in to the MRS console and click the cluster name to go to the cluster details page. Click Components.
    • If you are using the Manager of MRS 3.x and later versions, log in to Manager and choose Cluster > Services.
    • If you are using the Manager of MRS 2.x and earlier versions, log in to Manager and click Services.

  2. Click the name of the target component to go to the details page.
  3. On the service details page, expand the More drop-down list and select Restart Service or Service Rolling Restart.
  4. Enter the user password (required when you perform operations on Manager), confirm the operation impact, and click OK to restart the system.

    If you select rolling restart, set parameters listed in Table 1. (Required parameters may vary by version, set parameters based on the actual GUI.)

    Figure 1 Performing a rolling restart on Manager
    Table 1 Rolling restart configuration parameters

    Parameter

    Example Value

    Description

    Restart only instances with expired configurations

    -

    Whether to restart only the modified instances in a cluster.

    The name of this parameter may be different in other versions.

    Enable rack strategy

    -

    Whether to enable the concurrent rack rolling restart strategy. This parameter takes effect only for roles that meet the rack rolling restart strategy. (The roles support rack awareness, and instances of the roles belong to two or more racks.)

    This parameter can be set only when a rolling restart is performed on HDFS or YARN.

    Data Nodes to Be Batch Restarted

    1

    Number of instances that are restarted in each batch when the batch rolling restart strategy is used. The default value is 1.

    • This parameter is valid only when the batch rolling restart strategy is used and the instance type is DataNode.
    • This parameter is invalid when the rack strategy is enabled. In this case, the cluster uses the maximum number of instances (20 by default) configured in the rack strategy as the maximum number of instances that are concurrently restarted in a rack.
    • This parameter can be set only when a rolling restart is performed on some components, such as HDFS, HBase, YARN, Kafka, and Flume. The actual value displayed on the GUI prevails.
    • The number of HBase RegionServer nodes that can be concurrently restarted during a rolling restart cannot be manually configured. It is automatically adjusted based on the total number of RegionServer nodes.

      The adjustment rules are as follows: If there is less than 30 nodes, one node is restarted at a time. If there is less than 300 nodes, two nodes are restarted at a time. If there are 300 or more nodes, 1% of the total node count is restarted at a time.

    Batch Interval

    10

    Interval (in seconds) between restarting two batches of instances during a rolling restart. The default value is 0.

    Setting the batch interval parameter can increase the stability of the big data component process during the rolling restart.

    You are advised to set this parameter to a non-default value, for example, 10.

    Decommissioning Timeout Interval

    1800

    How long (in seconds) a role instance waits to be terminated after being marked for decommissioning during a rolling restart. This parameter can be set only when a rolling restart is performed on Hive or Spark.

    Some roles (such as HiveServer and JDBCServer) stop providing services before the rolling restart. Stopped instances cannot be connected to new clients. Existing connections will be completed after a period of time. An appropriate timeout interval can ensure service continuity.

    Batch Fault Tolerance Threshold

    0

    Tolerance times when the rolling restart of instances fails to be batch executed. The default value is 0, which indicates that the rolling restart task ends after any batch of instances fails to restart.

  5. After the restart or rolling restart of a component is successful, Running Status of the component is Normal and Configure Status is Synchronized.

    Figure 2 MRS cluster components

Component Restart Reference Information

Table 2 provides services and instances that support or do not support rolling restart in the MRS cluster.

Table 2 Services and instances that support or do not support rolling restart

Service

Instance

Rolling Restart

Alluxio

AlluxioJobMaster

Yes

AlluxioMaster

ClickHouse

ClickHouseServer

Yes

ClickHouseBalancer

CDL

CDLConnector

Yes

CDLService

Flink

FlinkResource

No

FlinkServer

Flume

Flume

Yes

MonitorServer

Guardian

TokenServer

Yes

HBase

HMaster

Yes

RegionServer

ThriftServer

RESTServer

HetuEngine

HSBroker

Yes

HSConsole

HSFabric

QAS

HDFS

NameNode

Yes

Zkfc

JournalNode

HttpFS

DataNode

Hive

MetaStore

Yes

WebHCat

HiveServer

Hue

Hue

No

Impala

Impalad

No

StateStore

Catalog

IoTDB

IoTDBServer

Yes

Kafka

Broker

Yes

KafkaUI

No

Kudu

KuduTserver

Yes

KuduMaster

Loader

Sqoop

No

MapReduce

JobHistoryServer

Yes

Oozie

Oozie

No

Presto

Coordinator

Yes

Worker

Ranger

RangerAdmin

Yes

UserSync

TagSync

Spark

JobHistory

Yes

JDBCServer

SparkResource

Storm

Nimbus

Yes

UI

Supervisor

Logviewer

Tez

TezUI

No

YARN

ResourceManager

Yes

NodeManager

ZooKeeper

Quorumpeer

Yes

Table 3 lists the instance startup duration.

Table 3 Restart duration for reference

Service

Restart Duration

Startup Duration

Remarks

IoTDB

3min

IoTDBServer: 3 min

-

CDL

2min

  • CDLConnector: 1 min
  • CDLService: 1 min

-

ClickHouse

4min

  • ClickHouseServer: 2 min
  • ClickHouseBalancer: 2 min

-

HDFS

10min+x

  • NameNode: 4 min + x
  • DataNode: 2 min
  • JournalNode: 2 min
  • Zkfc: 2 min

x indicates the NameNode metadata loading duration. It takes about 2 minutes to load 10,000,000 files. For example, x is 10 minutes for 50 million files.

The startup duration fluctuates with reporting of DataNode data blocks.

YARN

5min+x

  • ResourceManager: 3 min + x
  • NodeManager: 2 min

x indicates the time required for restoring ResourceManager reserved tasks. It takes about 1 minute to restore 10,000 reserved tasks.

MapReduce

2min+x

JobHistoryServer: 2 min + x

x indicates the scanning duration of historical tasks. It takes about 2.5 minutes to scan 100,000 tasks.

ZooKeeper

2min+x

quorumpeer: 2 min + x

x indicates the duration for loading znodes. It takes about 1 minute to load 1 million znodes.

Hive

3.5min

  • HiveServer: 3 min
  • MetaStore: 1 min 30s
  • WebHcat: 1 min
  • Hive service: 3 min

-

Spark2x

5min

  • JobHistory2x: 5 min
  • SparkResource2x: 5 min
  • JDBCServer2x: 5 min

-

Flink

4min

  • FlinkResource: 1 min
  • FlinkServer: 3 min

-

Kafka

2min+x

  • Broker: 1 min + x
  • Kafka UI: 5 min

x indicates the data restoration duration. It takes about 2 minutes to start 20,000 partitions for a single instance.

Storm

6min

  • Nimbus: 3 min
  • UI: 1 min
  • Supervisor: 1 min
  • Logviewer: 1 min

-

Flume

3min

  • Flume: 2 min
  • MonitorServer: 1 min

-

Doris

2 min

  • FE: 1 min
  • BE: 1 min
  • DBroker: 1 min

-

Table 4 describes the impact on the system during the rolling restart of components and instances.

Table 4 Impact on the system

Component

Service Interruption

Impact on System

ClickHouse

During the rolling restart, if the submitted workloads can be complete within the timeout period (30 minutes by default), there is no impact.

Nodes undergoing a rolling restart reject all new requests, which affects single-replica services, the ON CLUSTER operation, and workloads dependent on the instances being rolling-restarted. If a request that is being executed is not complete within the timeout period (30 minutes by default), the request fails.

DBService

All services are normal during the rolling restart.

During the rolling restart, alarms indicating a heartbeat interruption between the active and standby DBService nodes may be reported.

Doris

Doris services will not be interrupted during the rolling restart only when the following conditions are met:

  • ELB or DBalancer is used to connect to Doris for job submission.
  • Tables being read or written must have multiple replicas (three replicas are recommended).
  • If the submitted workloads are complete within the timeout period (30 minutes by default), the workloads are not affected. Otherwise, the workloads will fail.

During the rolling restart, the total resources decrease, affecting the maximum memory and CPU resources that can be used by jobs. In extreme cases, the jobs may fail due to insufficient resources. If a job times out (30 minutes by default), retry the job.

Flink

All services are normal during the rolling restart.

The FlinkServer UI cannot be accessed during the rolling restart.

Flume

To prevent service interruptions and data loss, the following conditions must be met:

  • Persistent cache, for example File Channel, is used.
  • Flume cascading is configured.
  • Flume client sink supports fault failover or load balancing.
  • Data loss may occur if channels do not have persistent cache.
  • The performance deteriorates for a short period of time during the fault failover.

Guardian

All services are normal during the rolling restart.

None

HBase

HBase read and write services are normal during the rolling restart.

  • Real-time read and write performance may deteriorate during RegionServer rolling restart.
  • During the HMaster rolling restart, services other than the real-time read and write services (excluding BulkLoad) are affected.

    Creating a table (create)

    Creating a namespace (create_namespace)

    Disabling a table (disable, disable_all)

    Re-creating a table (truncate and truncate_preserve)

    Moving a region (move)

    Taking a region offline (unassign)

    Merging regions (merge_region)

    Splitting a region (split)

    Enabling balance (balance_switch)

    DR operations (add_peer, remove_peer, enable_table_replication, disable_peer, show_peer_tableCFs, set_peer_tableCFs, enable_peer, disable_table_replication, set_clusterState_active, and set_clusterState_standby)

    Querying the cluster status (status)

HDFS

  • An active/standby switchover is triggered for NameNodes. During the switchover, there is temporarily no active NameNode. As a result, the HDFS unavailable alarm may be reported, and running read/write tasks may encounter errors. However, HDFS services are not interrupted.
  • If a rolling restart is being performed on DataNodes, errors may be reported for some data that is being read or written. Read/write speed is affected when the client retries.
  • During the rolling restart of ZKFC, an active/standby NameNode switchover occurs.
  • During the rolling restart of HDFS, enabling the rack policy may affect service read and write operations.

If a third-party client is used, the reliability of the third-party client during the rolling restart cannot be guaranteed.

HetuEngine

  • There are at least two HSFabric instances, and at least two HSFabric instances are used for interconnection. Cross-domain services are not interrupted during the rolling restart.
  • HetuEngine services are not interrupted during the rolling restart of HSBroker, HSConsole, and QAS.
  • The rolling restart of HetuEngine compute instances can be performed only when there are at least two Coordinator and Worker nodes, respectively. The rolling restart of compute instances does not interrupt services.
  • HSFabric nodes undergoing the rolling restart reject all new requests. SQL requests that are being executed will fail if they are not complete within the timeout period (30 minutes by default).
  • During the rolling restart, you cannot perform O&M operations on the HSConsole page.
  • During the rolling restart of HetuEngine compute instances, the performance decreases by approximately 20%. If the memory consumed by SQL execution exceeds 80% of query_max_total_memory, SQL tasks fail.

Hive

During the rolling restart, services with execution time longer than the decommissioning timeout period may fail.

  • If the execution time of an existing task exceeds the timeout interval of the rolling restart, the task may fail during the restart. You can retry the task if it fails.
  • Nodes undergoing a rolling restart do not receive new requests. The number of requests processed by other instance nodes increases, and these nodes occupy more resources.

Kafka

During the rolling restart, the read and write of Kafka topics with multiple replicas are normal, but operations on Kafka topics with only a single replica are interrupted.

  • Topics and partitions cannot be added, deleted, or modified.
  • When acks is set to 1 or 0 in Producer, the next Broker will be forcibly restarted if the replica data is not synchronized within 30 minutes during the rolling restart. For a dual-replica Partition whose replicas are in the two Brokers, when unclean.leader.election.enable is set to true in the server configuration, data may be lost; when unclean.leader.election.enable is set to false, the Partition may not have a leader for a period of time until the Broker starts.

KrbServer

All services are normal during the rolling restart.

During the rolling restart, Kerberos authentication of a cluster may take a longer time.

LdapServer

All services are normal during the rolling restart.

During the rolling restart, Kerberos authentication of a cluster may take a longer time.

MapReduce

None

  • You cannot view the task logs during the rolling restart.
  • In the scenario where only one JobHistoryServer instance is deployed, stopping JobHistoryServer will make all upper-layer components that depend on it become faulty.

Ranger

All services are normal during the rolling restart.

The RangerAdmin, RangerKMS, and PolicySync instances of Ranger are configured in active-active mode, and these instances can provide services in turn during the rolling restart. While UserSync supports only one-instance configuration and users cannot be synchronized during the restart.

The user synchronization period is 5 minutes, and UserSync takes a short time to restart. Therefore, the UserSync restart has little impact on user synchronization.

Spark

Except the listed items, other services are not affected.

  • When performing a rolling restart on HBase, you cannot create or delete Spark on HBase tables in Spark.
  • When you perform a rolling restart on HBase, an active/standby switchover is triggered for HMaster. During the switchover, the Spark on HBase function is unavailable.
  • If you have used the advanced Kafka APIs, interruptions and data loss may occur when Spark reads data from or writes data to Kafka during the rolling restart.
  • During the rolling restart of ZooKeeper, spark-beeline fails to start.
  • During the rolling restart of YARN, Spark jobs may trigger retries at the task, stage, and application levels.
  • During the rolling restart of JDBCServer of Spark, long-running tasks will be terminated.

YARN

  • An active/standby switchover is triggered for ResourceManager nodes during the rolling restart. Running tasks will encounter errors, but services are not interrupted.
  • During the rolling restart of NodeManager, containers submitted to the node may be retried on other nodes.

During the rolling restart of YARN, tasks running on YARN may experience exceptions due to excessive retries.

ZooKeeper

ZooKeeper read and write operations are normal during the rolling restart.

  • The rolling restart affects the ClickHouse service. During the instance restart, ClickHouse tables become read-only on each QuorumPeer instance for approximately 10 seconds.
  • The rolling restart causes ZooKeeper disconnection in Loader HA. Loader automatically retries for three times, each for 10 seconds. If ZooKeeper still cannot be connected, the active LoaderServer becomes the standby one and then a new active LoaderServer will be elected.

Helpful Links