What Can I Do If Status of Elasticsearch Shards (Unassigned Shards) Becomes Down?

Symptom

The Elasticsearch cluster reports an error message indicating that the Elasticsearch instance has a master shard in the down state or the Elasticsearch instance has a replica shard in the down state.

Procedure

Log in to any EsNode and run the following command to view the shard that is in the down state. The unassigned.reason column indicates that the shard becomes down because it is an unassigned shard.

curl -XGET --tlsv1.2 --negotiate -k -u : "https://ip:httpport/_cat/shards/indexname?v&h=index,shard,prirep,state,node,unassigned.reason" | grep UNASSIGNED
- ip: indicates the IP address of any EsNode in the Elasticsearch cluster.
- httpport: indicates the HTTP port number of the Elasticsearch instance. To obtain the port number, log in to Manager, select the Elasticsearch service of the cluster to be operated, choose Configurations > All Configurations, and search for SERVER_PORT in the upper right corner. Use the port of the EsNodeX instance for access.
- indexname: indicates the name of the index of the shards to be queried.

Locate the causes of shard unassignment. Possible causes are as follows:
1. INDEX_CREATED: The API for creating an index introduces the problem.
2. CLUSTER_RECOVERED: Full data restoration is performed for the cluster.
3. INDEX_REOPENED: An index is enabled or disabled.
4. DANGLING_INDEX_IMPORTED: The result of dangling index is not imported.
5. NEW_INDEX_RESTORED: Data is restored from the snapshot to a new index.
6. EXISTING_INDEX_RESTORED: Data is restored from the snapshot to a disabled index.
7. REPLICA_ADDED: Replica shards are added explicitly.
8. ALLOCATION_FAILED: Shard assignment fails.
9. NODE_LEFT: The node that carries the shard is located outside of the cluster.
10. REINITIALIZED: Misoperations (such as use of the shadow replica shard) exist in the process from moving the shard to the shard initialization.
11. REROUTE_CANCELLED: The assignment is canceled because the routing is canceled explicitly.
12. REALLOCATED_REPLICA: It is determined that a better replica location will be used, and the existing replica assignment is canceled. As a result, the shard is unassigned.

If the value of unassigned.reason is ALLOCATION_FAILED, run the following command:

curl -XPOST --tlsv1.2 --negotiate -k  -u :  "https://ip:httpport/_cluster/reroute?retry_failed=true"

Execute explanation of the API to view the detailed cause of the unassignment of the shard. The three parameters in the request can be queried in the command output in Step 1. The first column indicates the index name, the second column indicates the shard ID, and the third column indicates whether the shard is the primary shard.
```
curl -XGET --tlsv1.2 --negotiate -k  -u :  "https://ip:httpport/_cluster/allocation/explain?pretty" -H 'Content-Type:application/json' -d '{
    "index": "indexname",
    "shard": shardId,
    "primary": isPrimary
}'
```
- indexname: indicates the index name in the first column in the output of the _cat/shards command.
- shardId: indicates the shard ID in the second column in the output of the _cat/shards command.
- isPrimary: indicates whether the shard to be queried is a primary or a replica shard. If the value in the third column in the output of the _cat/shards command is p, the shard is the primary shard and you need to set this parameter to true. Otherwise, the shard is a replica shard, and you need to set this parameter to false.

Analyze the output of the explain command. If the command output contains the explanation field, check whether the information related to this field contains a recommended solution. Give priority to the recommended solution if any.
Check whether the shards in the UNASSIGNED state in the cat/shards command in 1 are concentrated in a few EsNode instances. If yes, restart these Elasticsearch instances separately to trigger shard restoration.