Performing a Rolling Restart of a Cluster

Scenario

A rolling restart is batch restarting all services in a cluster after they are modified or upgraded without interrupting workloads.

You can perform a rolling restart of a cluster as needed.

Certain services in a cluster do not support rolling restart. These services are restarted in normal mode during the rolling restart of the cluster. As a result, workloads may be interrupted. So, you need to determine whether to perform this operation as prompted.
Configurations that must take effect immediately, for example, server port configurations, should be restarted in normal mode.

Impact on the System

A rolling restart takes a longer time and may affect service throughput and performance.

Procedure

Log in to FusionInsight Manager.
Choose Cluster > Dashboard. In the upper right corner, click More > Service Rolling Restart.
In the dialog box that is displayed, enter the password of the current login user and click OK.

Configure the parameters based on site requirements.

Figure 1 Rolling-restart Cluster
Click to enlarge

**Table 1** Rolling restart parameters
Parameter	Description
Restart only instances with expired configurations in the cluster	Whether to restart only the modified instances in a cluster
Enable rack strategy	Whether to enable the concurrent rack rolling restart strategy. This parameter takes effect only for roles that meet the rack rolling restart strategy. (The roles support rack awareness, and instances of the roles belong to two or more racks.) NOTE: This parameter is configurable only when a rolling restart is performed on HDFS or YARN.
Data Nodes to Be Batch Restarted	Number of instances that are restarted in each batch when the batch rolling restart strategy is used. The default value is 1. NOTE: This parameter is valid only when the batch rolling restart strategy is used and the instance type is DataNode. This parameter is invalid when the rack strategy is enabled. In this case, the cluster uses the maximum number of instances (20 by default) configured in the rack strategy as the maximum number of instances that are concurrently restarted in a rack. This parameter is configurable only when a rolling restart is performed on HDFS, HBase, YARN, Kafka, Storm, or Flume. This parameter for the RegionServer of HBase cannot be manually configured. Instead, it is automatically adjusted based on the number of RegionServer nodes. Specifically, if the number of RegionServer nodes is less than 30, the parameter value is 1. If the number is greater than or equal to 30 and less than 300, the parameter value is 2. If the number is greater than or equal to 300, the parameter value is 1% of the number (rounded-down).
Batch Interval	Interval between two batches of instances to be roll-restarted. The default value is 0.
Decommissioning Timeout Interval	Decommissioning interval for role instances during a rolling restart. The default value is 1800s. Some roles (such as HiveServer and JDBCServer) stop providing services before the rolling restart. Stopped instances cannot be connected to new clients. Existing connections will be completed after a period of time. An appropriate timeout interval can ensure service continuity. NOTE: This parameter is configurable only when a rolling restart is performed on Hive or Spark2x.
Batch Fault Tolerance Threshold	Tolerance times when the rolling restart of instances fails to be batch executed. The default value is 0, which indicates that the rolling restart task ends after any batch of instances fails to restart.

Advanced parameters, such as Data Nodes to Be Batch Restarted, Batch Interval, and Batch Fault Tolerance Threshold, should be properly configured based on site requirements. Otherwise, services may be interrupted or cluster performance may be severely affected.

Example:

If Data Nodes to Be Batch Restarted is set to an unnecessarily large value, a large number of instances are restarted concurrently. As a result, services are interrupted or cluster performance is severely affected due to too few working instances.
If Batch Fault Tolerance Threshold is too large, services will be interrupted because a next batch of instances will be restarted after a batch of instances fails to restart.