Restarting an MRS Cluster Component

To apply configuration changes to a big data component, you must restart it. However, using the common restart mode will restart all services or instances at once, which can cause service interruption.

To ensure that services are not affected during service restart, you can restart services or instances in batches by rolling restart. For instances in active/standby mode, a standby instance is restarted first and then an active instance is restarted.

A rolling restart takes a longer time and may affect service throughput and performance.

For details about whether services and instances in the current MRS cluster support rolling restart and the rolling restart parameters, see Component Restart Reference Information.

Restrictions

Perform a rolling restart during off-peak hours.
- If the service throughput of the Kafka service is high (over 100 MB/s) during a rolling restart, the restart will fail.
- To avoid RegionServer restart failures caused by heavy loads during an HBase rolling restart, increase the number of handles if the requests per second of each RegionServer on the native interface exceed 10,000.
Before restarting, check the current number of requests in HBase. If the number of requests on the native interface for each RegionServer is over 10,000, increase the number of handles to prevent overloading.
If the number of Core nodes in a cluster is less than six, services may be affected for a short period of time.
Preferentially perform a rolling instance or service restart and select Only restart instances whose configurations have expired.

Prerequisites

The IAM users have been synchronized in advance. You can do this by clicking Synchronize next to IAM User Sync on the Dashboard page of the cluster details.
You have logged in to MRS Manager. For how to log in, see Accessing MRS Manager.

Restarting Cluster Components

Access the MRS cluster component management page.
- Log in to the MRS console and click the cluster name to go to the cluster details page. Click Components.
- If you are using the Manager of MRS 3.x and later versions, log in to Manager and choose Cluster > Services.
- If you are using the Manager of MRS 2.x and earlier versions, log in to Manager and click Services.
Click the name of the target component to go to the details page.
On the service details page, expand the More drop-down list and select Restart Service or Service Rolling Restart.

Enter the user password (required when you perform operations on Manager), confirm the operation impact, and click OK to restart the system.

If you select rolling restart, set parameters listed in Table 1. (Required parameters may vary by version, set parameters based on the actual GUI.)

Figure 1 Performing a rolling restart on Manager

**Table 1** Rolling restart configuration parameters
Parameter	Description
Restart only instances with expired configurations	Whether to restart only the modified instances in a cluster. The name of this parameter may be different in other versions.
Enable rack strategy	Whether to enable the concurrent rack rolling restart strategy. This parameter takes effect only for roles that meet the rack rolling restart strategy. (The roles support rack awareness, and instances of the roles belong to two or more racks.) This parameter can be set only when a rolling restart is performed on HDFS or YARN.
Data Nodes to Be Batch Restarted	Number of instances that are restarted in each batch when the batch rolling restart strategy is used. The default value is 1. NOTE: This parameter is valid only when the batch rolling restart strategy is used and the instance type is DataNode. This parameter is invalid when the rack strategy is enabled. In this case, the cluster uses the maximum number of instances (20 by default) configured in the rack strategy as the maximum number of instances that are concurrently restarted in a rack. This parameter can be set only when a rolling restart is performed on some components, such as HDFS, HBase, YARN, Kafka, Storm, and Flume. The actual value displayed on the GUI prevails. The number of concurrent RegionServer rolling restarts of HBase cannot be manually configured. It is automatically adjusted based on the number of RegionServer nodes. The adjustment rules are as follows: If the number of nodes is less than 30, one node will be added in each batch. For node counts less than 300, two nodes will be added in each batch. If the node count exceeds 300 (including 300 nodes), each batch will add 1% (rounded down) of the total nodes.
Batch Interval	Interval between two batches of instances to be roll-restarted. The default value is 0. Setting the batch interval parameter can increase the stability of the big data component process during the rolling restart. You are advised to set this parameter to a non-default value, for example, 10.
Decommissioning Timeout Interval	Decommissioning waiting time of a role instance during a rolling restart. This parameter can be set only when a rolling restart is performed on Hive or Spark. Some roles (such as HiveServer and JDBCServer) stop providing services before the rolling restart. Stopped instances cannot be connected to new clients. Existing connections will be completed after a period of time. An appropriate timeout interval can ensure service continuity.
Batch Fault Tolerance Threshold	Tolerance times when the rolling restart of instances fails to be batch executed. The default value is 0, which indicates that the rolling restart task ends after any batch of instances fails to restart.

Component Restart Reference Information

Table 2 provides services and instances that support or do not support rolling restart in the MRS cluster.

**Table 2** Services and instances that support or do not support rolling restart
Service	Instance	Rolling Restart
Alluxio	AlluxioJobMaster	Yes
Alluxio	AlluxioMaster	Yes
ClickHouse	ClickHouseServer	Yes
ClickHouse	ClickHouseBalancer	Yes
CDL	CDLConnector	Yes
CDL	CDLService	Yes
Flink	FlinkResource	No
Flink	FlinkServer	No
Flume	Flume	Yes
Flume	MonitorServer	Yes
Guardian	TokenServer	Yes
HBase	HMaster	Yes
	RegionServer
	ThriftServer
	RESTServer
HetuEngine	HSBroker	Yes
	HSConsole
	HSFabric
	QAS
HDFS	NameNode	Yes
	Zkfc
	JournalNode
	HttpFS
	DataNode
Hive	MetaStore	Yes
	WebHCat
	HiveServer
Hue	Hue	No
Impala	Impalad	No
	StateStore
	Catalog
IoTDB	IoTDBServer	Yes
Kafka	Broker	Yes
Kafka	KafkaUI	No
Kudu	KuduTserver	Yes
Kudu	KuduMaster	Yes
Loader	Sqoop	No
MapReduce	JobHistoryServer	Yes
Oozie	oozie	No
Presto	Coordinator	Yes
Presto	Worker	Yes
Ranger	RangerAdmin	Yes
	UserSync
	TagSync
Spark	JobHistory	Yes
	JDBCServer
	SparkResource
Storm	Nimbus	Yes
	UI
	Supervisor
	Logviewer
Tez	TezUI	No
Yarn	ResourceManager	Yes
Yarn	NodeManager	Yes
ZooKeeper	Quorumpeer	Yes

Table 3 lists the instance startup duration.

**Table 3** Restart duration for reference
Service	Restart Duration	Startup Duration	Remarks
IoTDB	3min	IoTDBServer: 3 min	-
CDL	2min	CDLConnector: 1 min CDLService: 1 min	-
ClickHouse	4min	ClickHouseServer: 2 min ClickHouseBalancer: 2 min	-
HDFS	10min+x	NameNode: 4 min + x DataNode: 2 min JournalNode: 2 min Zkfc: 2 min	x indicates the NameNode metadata loading duration. It takes about 2 minutes to load 10,000,000 files. For example, x is 10 minutes for 50 million files. The startup duration fluctuates with reporting of DataNode data blocks.
Yarn	5min+x	ResourceManager: 3 min + x NodeManager: 2 min	x indicates the time required for restoring ResourceManager reserved tasks. It takes about 1 minute to restore 10,000 reserved tasks.
MapReduce	2min+x	JobHistoryServer: 2 min + x	x indicates the scanning duration of historical tasks. It takes about 2.5 minutes to scan 100,000 tasks.
ZooKeeper	2min+x	quorumpeer: 2 min + x	x indicates the duration for loading znodes. It takes about 1 minute to load 1 million znodes.
Hive	3.5min	HiveServer: 3 min MetaStore: 1 min 30s WebHcat: 1 min Hive service: 3 min	-
Spark2x	5min	JobHistory2x: 5 min SparkResource2x: 5 min JDBCServer2x: 5 min	-
Flink	4min	FlinkResource: 1 min FlinkServer: 3 min	-
Kafka	2min+x	Broker: 1 min + x Kafka UI: 5 min	x indicates the data restoration duration. It takes about 2 minutes to start 20,000 partitions for a single instance.
Storm	6min	Nimbus: 3 min UI: 1 min Supervisor: 1 min Logviewer: 1 min	-
Flume	3min	Flume: 2 min MonitorServer: 1 min	-
Doris	2 min	FE: 1min BE: 1min DBroker: 1min	-