Master Node Overload

A master node overload in an Elasticsearch/OpenSearch cluster can cause the following issues:

The cluster status changes to yellow or red, and metadata operations cannot be performed.
Multiple nodes are disconnected from the cluster.
master node failed and restarting discovery are generated in the log.
When APIs are called, a message is displayed indicating that the master node does not exist.

Too many pending tasks: Pending tasks can accumulate, for example, during service changes that generate a large number of PUT mapping tasks, or due to task priority conflicts where urgent tasks block routine tasks like index creation.
Too many shards: An excessively large number of shards can cause high memory and CPU usage on master nodes and lead to increased task processing delays.
Metadata processing overload: Frequent metadata operations (such as index creation and shard allocation) or expired cluster state updates can cause the master nodes' CPU usage to stay above 90%, affecting cluster stability.

Scenario 1: Too many pending tasks

Solution to accumulation of pending tasks:
1. Restart the cluster (all nodes) to release pending tasks and thereby restore the cluster.
2. Temporarily block index writes and mapping updates to prevent further accumulation of pending tasks.
```
PUT my_index/_settings
{
  "index.blocks.metadata": true,
  "index.blocks.write": true,
  "index.blocks.read_only": true
}
```
3. After the cluster recovers, restore index writes and mapping updates to restore service availability.
```
PUT my_index/_settings
{
  "index.blocks.metadata": false,
  "index.blocks.write": false,
  "index.blocks.read_only": false
}
```
4. Verify the cluster status. If the cluster status is green, the fault has been rectified.
```
GET _cluster/health?pretty
```
Solution to task priority conflicts:
1. Identify the cause of index creation failures. For example, check the index template configuration, check whether the number of shards has exceeded the cluster capacity, and check whether the cluster has sufficient storage space.
2. Rectify configuration issues. For example, modify index template parameters (such as the number of shards and replicas), reclaim storage space, and modify shard allocation policies.
3. Resubmit index creation requests to verify fault rectification.

Scenario 2: Too many shards

Solution 1: Upgrade master node specifications so they can handle more shards.
Recommended heap memory-to-shards ratio: 200 shards per GB of heap memory. For example, if there are 4,000 shards, there should be at least 20 GB of heap memory.
Solution 2: Optimize shard policies to reduce the pressure of the master nodes in metadata processing.
For example, merge small shards (using the _shrink API), adjust the number of shards in the index template, and increase the number of replicas to dilute the load.

Scenario 3: Metadata processing overload

Solution 1: Upgrade master node specifications to increase their capacity and prevent CPU overload.
Increase the number of vCPUs to at least 8, and increase the memory capacity to at least 32 GB.
Solution 2: Reduce the frequency and complexity of metadata processing.
For example, avoid frequent metadata changes, merge batch operations (such as batch index creation), and optimize shard allocation policies.

Parent topic: Unavailable Clusters