Performing a Rolling Restart of a Cluster

Scenarios

Rolling restart means to restart a cluster without interrupting services after the service role is updated or the configuration is modified in the cluster.

If you need to restart all services in the cluster in batches without interrupting services, you can perform a rolling restart.

Some services do not support a rolling restart. These services will experience a common restart during the rolling restart and may be interrupted. Perform operations as prompted.
For configurations that must take effect immediately, for example, configuration of the port for a server, a rolling restart is not recommended. Perform a common restart instead.

Impact on the System

Compared with a common restart, a rolling restart does not interrupt services, but it takes longer time than a common restart and may affect throughput and performance of the service to be restarted.

Procedure

Log in to FusionInsight Manager.
Choose Cluster > Name of the desired cluster > Dashboard > More > Rolling-restart Service.
In the displayed dialog box, enter the password of the current login user and click OK.

Set the parameters as required, as shown in Table 1.

**Table 1** Rolling restart parameters
Parameter	Description
Restart only instances with expired configurations in the cluster	Specifies whether to restart only the modified instances in a cluster.
Enable rack strategy	Specifies whether to enable the concurrent rolling restart of rack strategy. This option takes effect for roles that meet the rolling restart requirements of the rack strategy. (The roles support the rack-aware function, and instances of the roles belong to two or more racks). NOTE: This parameter can be set only when a rolling restart is performed on HDFS or YARN.
Data Nodes to Be Batch Restarted	Specifies the number of instances that are restarted for each batch when the batch rolling restart strategy is used. The default value is 1. NOTE: This parameter is valid only when the batch rolling restart strategy is used and the instance is the DataNode. When the rack strategy is enabled, this parameter is invalid. In this case, the cluster uses the default maximum number of instances (20) configured in the rack strategy as the maximum number of instances that are concurrently restarted in a rack. This parameter can be set only when a rolling restart is performed on HDFS, YARN, Kafka, Storm, or Flume. This parameter for the RegionServer of HBase cannot be manually configured. Instead, it is automatically adjusted based on the number of RegionServer nodes. Specifically, if the number of RegionServer nodes is less than 30, the parameter value is 1. If the number is greater than or equal to 30 and less than 300, the parameter value is 2. If the number is greater than or equal to 300, the parameter value is 1% of the number (rounded-down).
Batch Interval	Specifies the interval between two batches of instances to be rolling restarted. The default value is 0.
Decommissioning Timeout Interval	Specifies the decommissioning timeout interval for role instances during a rolling restart. The default value is 1800s. Some roles (such as HiveServer and JDBCServer) stop providing services before the rolling restart. Stopped instances cannot establish new connections. Existing connections will be completed after a period of time. A proper configuration of the timeout parameters can minimize the risk of service interruption. NOTE: This parameter can be set only when a rolling restart is performed for Hive and Spark2x.
Batch Fault Tolerance Threshold	Specifies the tolerance times when the rolling restart of instances fails to be executed in batches. The default value is 0, which indicates that the rolling restart task ends after any batch of instances fails to be restarted.

Set advanced parameters, such as Data Nodes to Be Batch Restarted, Batch Interval, and Batch Fault Tolerance Threshold based on site requirements. Otherwise, services may be interrupted or the performance may be severely affected. Therefore, exercise caution when performing this operation.

The following shows an example:

If Data Nodes to Be Batch Restarted is too large, a great number of instances are restarted at the same time. As a result, services are interrupted or the performance is severely affected because the number of remaining instances is small.
If Batch Fault Tolerance Threshold is too large, services will be interrupted when a new batch of instances is restarted after the previous instance restart failed.