Node Overload During Cluster Upgrade or Node Specifications Change

Symptom

During a cluster upgrade (for example, a rolling upgrade) or node specifications change, an inappropriate recovery setting may cause the load of some or all of the nodes in the cluster to increase drastically. Typical symptoms include:

High node load: During a cluster upgrade or node specifications change, the CPU and JVM memory usage of the cluster nodes increases drastically.
Node disconnection: Some nodes are removed from the cluster due to resource exhaustion, and the cluster status changes to red.
Task interruption: The execution time of the upgrade or node specifications change task increases significantly, and some tasks expire and fail.
Resource bottlenecks: The monitoring data shows the CPU usage of some nodes exceeds 90% and the JVM memory usage keeps increasing.

Possible Causes

Inappropriate recovery settings: The indices.recovery.max_bytes_per_sec setting is too high. This leads to high data migration speed, causing high CPU and memory usage and node overload.
Too many pending tasks: During a cluster upgrade or node specifications change, automatically triggered shard migration and replica reallocation tasks may be delayed due to inappropriate parallelism settings, further increasing resource consumption.
Insufficient resources: The cluster nodes have limited capacities, disk I/O, or network bandwidth and are unable to support high concurrency, leading to low task execution efficiency.

Solutions

Adjust recovery settings to reduce the communication overhead between nodes and improve cluster stability.
Reduce the data migration speed to reduce the node load and ensure cluster stability. The downside is that the upgrade or node specifications change may take longer.
```
PUT _cluster/settings
{
  "transient": {
    "indices.recovery.max_bytes_per_sec": "40mb",
    "cluster.routing.allocation.node_concurrent_outgoing_recoveries": 2,
    "cluster.routing.allocation.cluster_concurrent_rebalance": 2,
    "cluster.routing.allocation.node_initial_primaries_recoveries": 2
  }
}
```
Wait for 5 to 10 minutes. Then check metrics like CPU usage and JVM memory usage to see if the load has dropped.
- Yes: Check whether the cluster status is normal and that all nodes are healthy. If they are, the issue is fixed.
- No: Further reduce indices.recovery.max_bytes_per_sec. If the load is still high or the task has expired or failed, go to the next step.
Terminate the task. If it is an upgrade task, you can terminate it on the CSS console. However, you can only do that if none of the cluster nodes has been successfully upgraded. If the task cannot be terminated, contact technical support.
Add more nodes or upgrade the node specifications to ensure there are sufficient resources to meet current service requirements.
After the capacity expansion is complete, wait for 10 to 15 minutes and make sure the new nodes have started functioning.
After the capacity expansion is successful, run the upgrade or specifications change task again.
- If the task succeeds, the issue is fixed.
- Otherwise, contact technical support.