Replacing Specified Nodes for an OpenSearch Cluster

If a node in an OpenSearch cluster is faulty, you can replace it to restore services.

The node replacement process is as follows:

Migrate data from the node that needs to be replaced to other available nodes.
Rebuild a new node using this node's current ID, IP address, specifications, and AZ.
Add the new node into the cluster. The system automatically triggers a shard reallocation, migrating some of the shards to the new node.

This process does not interrupt services because data is migrated from the replaced node to other available nodes.

Constraints

Only one node can be replaced at a time. Each new node is rebuilt using the ID, IP address, specifications, and AZ of the node it is replacing.
The configurations you modified manually will not be retained after node replacement. For example, if you have manually added a return route for the original node, you need to add it again for the new node after the node replacement is complete.
If the node you want to replace is a data node or cold data node, pay attention to the following precautions:
- When a data node or cold data node is replaced, its data is first migrated to other data nodes. This means the total number of data nodes and cold data nodes must be greater than the maximum number of index replicas plus 1.
- In the AZ that contains the data node or cold data node to be replaced, there has to be at least another data node or cold data node.
- If the cluster has not master nodes, the total number of data nodes plus cold data nodes must be at least three.
- The precautions above do not apply if you are replacing a faulty node, regardless of its type. This is because faulty nodes are not included in _cat/nodes.

Change Impact

Before replacing a node, it is essential to assess the potential impacts and review operational recommendations. This enables proper scheduling of the node replacement, minimizing service interruptions.

Impact on performance
Replacing a node does not interrupt services. However, data migration that occurs during this process consumes I/O performance, and taking individual nodes offline still has some impact on the overall cluster performance.
To minimize this impact, it is advisable to adjust the data migration rate based on the cluster's traffic cycle: increase the data migration rate during off-peak hours to shorten the task duration, and decrease it before peak hours arrive to ensure optimal cluster performance. The data migration rate is determined by the indices.recovery.max_bytes_per_sec parameter. The default value of this parameter is the number of vCPUs multiplied by 8 MB. For example, for four vCPUs, the data migration rate is 32 MB. You can adjust it based on the service requirements.
```
PUT /_cluster/settings  
{  
  "transient": {  
    "indices.recovery.max_bytes_per_sec": "128MB"  
  }  
}
```
Impact on request handling
While a node is being replaced, requests sent to it may fail. To mitigate this impact, the following measures may be taken:
- Use a VPC endpoint or a dedicated load balancer to handle access requests to your cluster, which makes sure that requests are automatically routed to available nodes.
- Enable an exponential backoff & retry mechanism on the client (configure three retries).
- Perform this operation during off-peak hours.
Characteristics of this process
Once started, a node replacement task cannot be stopped until it succeeds or fails. A task failure only impacts a single node, and does not interrupt services if there are data replicas, but the failed node still needs to be restored promptly.

Node Replacement Duration

The following formula can be used to estimate how long it will take to replace a specified node of a cluster:

Change duration (min) = 15 (min) + Data migration duration (min)

where, 15 minutes indicates how long non-data migration operations (e.g., initialization) typically take per node. It is an empirical value.

Data migration duration (min) = Total data size (MB)/[Total number of vCPUs of the data nodes x 32 (MB/s) x 60 (s)]

where,

8 MB/s indicates that each vCPU can process 8 MB of data per second. It is an empirical value.
The formulas above use estimates under ideal conditions. The actual migration speed depends on cluster load.

Prerequisites

The cluster status is Available, and there are no ongoing tasks.
All mission-critical data has been backed up. For details, see Creating Snapshots to Back Up the Data of an OpenSearch Cluster.

Replacing a Specified Node

Log in to the CSS management console.
In the navigation pane on the left, choose Clusters > OpenSearch.
In the cluster list, find the target cluster, and choose More > Modify Configuration in the Operation column. The Modify Configuration page is displayed.
On the Modify Configuration page, click the Replace Node tab.

On the Replace Node tab, set the parameters as needed.

**Table 1** Replacing a specified node
Parameter	Description
Agency	After a node is replaced, NICs need to be reattached to the new node. This means you need to have VPC permissions. Select an IAM agency to grant the current account the permission to access and use VPC. This parameter is available only when the new IAM plane is connected. If you are configuring an agency for the first time, click Automatically Create IAM Agency to create css-upgrade-agency. If there is an IAM agency automatically created earlier, you can click One-click authorization to have the permissions associated with the VPC Administrator role or the VPC FullAccess system policy deleted automatically, and have the following custom policies added automatically instead to implement more refined permissions control. "vpc:subnets:get", "vpc:ports:" To use Automatically Create IAM Agency* and One-click authorization, the following minimum permissions are required: "iam:agencies:listAgencies", "iam:roles:listRoles", "iam:agencies:getAgency", "iam:agencies:createAgency", "iam:permissions:listRolesForAgency", "iam:permissions:grantRoleToAgency", "iam:permissions:listRolesForAgencyOnProject", "iam:permissions:revokeRoleFromAgency", "iam:roles:createRole" To use an IAM agency, the following minimum permissions are required: "iam:agencies:listAgencies", "iam:agencies:getAgency", "iam:permissions:listRolesForAgencyOnProject", "iam:permissions:listRolesForAgency"
Node Type	Select the node you want to replace. You can expand a node type to check all the nodes under it.

Click Submit. In the data migration confirmation dialog box, choose to migrate data, and click OK.
During data migration, the system migrates all data from the to-be-replaced node to the remaining nodes, and replace the node upon completion of the data migration. If the data on the to-be-replaced nodes has replicas on other nodes, data migration can be skipped and the cluster change can be completed faster.
Click Back to Cluster List to go back to the Clusters page. The Task Status is Upgrading. When Cluster Status changes to Available, the node has been successfully replaced.