Updated on 2024-11-29 GMT+08:00

Scaling In a Cluster

You can reduce the number of core or task nodes to scale in a cluster based on service requirements so that MRS delivers better storage and computing capabilities at lower O&M costs.

The scale-in operation is not allowed for a cluster that is performing active/standby synchronization.

Background

A cluster can have three types of nodes, master, core, and task nodes. Currently, only core and task nodes can be removed. To scale in a cluster, you only need to adjust the number of nodes on the MRS console. MRS then automatically selects the nodes to be removed.

The policies for MRS to automatically select nodes are as follows:

  • MRS does not select the nodes with basic components installed, such as ZooKeeper, DBService, KrbServer, and LdapServer, because these basic components are the basis for the cluster to run.
  • Core nodes store cluster service data. When scaling in a cluster, ensure that all data on the core nodes to be removed has been migrated to other nodes. You can perform follow-up scale-in operations only after all component services are decommissioned, for example, removing nodes from Manager and deleting ECSs. When selecting core nodes, MRS preferentially selects the nodes with a small amount of data and healthy instances to be decommissioned to prevent decommissioning failures. For example, if DataNodes are installed on core nodes in an analysis cluster, MRS preferentially selects the nodes with small data volume and good health status during scale-in.

    When core nodes are removed, their data is migrated to other nodes. If the user business has cached the data storage path, the client will automatically update the path, which may increase the service processing latency temporarily. Cluster scale-in may slow the response of the first access to some HBase on HDFS data. You can restart HBase or disable or enable related tables to resolve this issue.

  • Task nodes are computing nodes and do not store cluster data. Data migration is not involved in removing task nodes. Therefore, when selecting task nodes, MRS preferentially selects nodes whose health status is faulty, unknown, or subhealthy. On the Components tab of the MRS console, click a service and then the Instances tab to view the health status of the node instances.

Scale-In Verification Policy

To prevent component decommissioning failures, components provide different decommissioning constraints. Scale-in is allowed only when the constraints of all installed components are met. Table 1 describes the scale-in verification policies.

Table 1 Decommissioning constraints

Component

Constraint

HDFS/DataNode

The number of available nodes after the scale-in is greater than or equal to the number of HDFS copies and the total HDFS data volume does not exceed 80% of the total HDFS cluster capacity.

This ensures that the remaining space is sufficient for storing existing data after the scale-in and reserves some space for future use.

NOTE:

To ensure data reliability, one backup is automatically generated for each file saved in HDFS, that is, two copies are generated in total.

HBase/RegionServer

The total available memory of RegionServers on all nodes except the nodes to be removed is greater than 1.2 times of the memory which is currently used by RegionServers on these nodes.

This ensures that the node to which the region on a decommissioned node is migrated has sufficient memory to bear the region of the decommissioned node.

Storm/ Supervisor

After the scale-in, ensure that the number of slots in the cluster is sufficient for running the submitted tasks.

This prevents no sufficient resources being available for running the stream processing tasks after the scale-in.

Flume/FlumeServer

If FlumeServer is installed on a node and Flume tasks have been configured for the node, the node cannot be deleted.

This prevents the deployed service program from being deleted by mistake.

ClickHouse/ClickHouseServer

For details, see Constraints on ClickHouseServer Scale-in.

This ensures that data on the decommissioned nodes is migrated to in-use nodes.

Scaling In a Cluster by Specifying the Node Quantity

  1. Log in to the MRS console.
  2. Choose Clusters > Active Clusters, select a running cluster, and click its name to switch to the cluster details page.
  3. Click the Nodes tab. In the Operation column of the node group, click Scale In to go to the Scale In page.

    This operation can be performed only when the cluster and all nodes in it are running.

  4. Set Scale-In Type to Node quantity.
  5. Set Scale-In Nodes and click OK.

    • Before scaling in the cluster, check whether its security group configuration is correct. Ensure that an inbound security group rule contains a rule in which Protocol & Port is set to All, and Source is set to a trusted accessible IP address range.
    • If damaged data blocks exist in HDFS, the cluster may fail to be scaled in. Contact technical support.

  6. A dialog box displayed in the upper right corner of the page indicates that the task of removing the node is submitted successfully.

    The cluster scale-in process is explained as follows:
    • During scale-in: The cluster status is Scaling In. The submitted jobs will be executed, and you can submit new jobs. You are not allowed to continue to scale in or delete the cluster. You are advised not to restart the cluster or modify the cluster configuration.
    • Successful scale-in: The cluster status is Running.
    • Failed scale-in: The cluster status is Running. You can execute jobs or scale-in the cluster again.

    After the cluster is scaled in, you can view the node information of the cluster on the Nodes page.

Scaling In a Cluster by Removing Nodes that Are No Longer Needed

If a faulty node is no longer needed, you can use this function to remove it. When the node is removed, the instance of the component role will not be decommissioned. Before deleting the node, ensure that the data on the node has been backed up. For details about how to remove ClickHouseServer nodes, see Removing ClickHouseServer Instance Nodes.

  1. Log in to MRS Manager and choose Hosts.
  2. Select the host to be removed, choose More, and select Isolate to isolate the host.

    The time required for isolating a host depends on the data volume on the host. A larger data volume requires a longer time.

    After the node is isolated, the node status changes to Isolated.

    • If the host isolation fails, log in to MRS Manager, click to search for the task that fails to isolate the host in the task list, and rectify the fault as prompted.
    • Isolating a host helps you decommission a node. If data on the node has been backed up, you can skip the operation of isolating a host, directly stop the host on the ECS console, and scale in the host.
    • If a host is faulty, forcibly remove the node.

  3. Log in to the MRS console.
  4. Click the name of the cluster to go to its details page.
  5. Click the Nodes tab.
  6. Locate the row that contains the target node group and click Scale In in the Operation column to go to the Scale In page.
  7. Set Scale-In Type to Specific node and select the node to be removed.

    Nodes in the Stopped, Lost, Unknown, Isolated, or Faulty status can be specified for scale-in. If the node cannot be selected, click Stop ECS to go to the ECS console to stop the node. On the cluster details page of the MRS console, click the Alarms tab and check whether any service fault alarms are generated after the node is stopped. If no such an alarm is generated, go back to the Scale In page and select the corresponding node for scale-in. If such an alarm is generated, clear the alarm before the scale-in.

  8. Select I understand the consequences of performing the scale-in operation, and click OK.
  9. Click the Components tab and check whether each component is normal. If any component is abnormal, wait for 5 to 10 minutes and check the component status again. If the fault persists, contact technical support.
  10. Click the Alarms tab and check whether there are exception alarms. If there are exception alarms, clear them before performing other operations.