Help Center/ Cloud Search Service/ Troubleshooting/ Clusters/ What Do I Do If My Cluster Status Is Unavailable?

Updated on 2022-08-31 GMT+08:00

View PDF

What Do I Do If My Cluster Status Is Unavailable?

Symptom

A CSS cluster status is Unavailable.

Possible Causes

The CSS backend reports unavailable cluster status to the console. The possible causes are as follows:

The cluster is abnormal or faulty.
The cluster background status is red.

Procedure

Check whether you can log in to Kibana.

If you can log in to Kibana, perform the following steps:

In the Operation column of the unavailable cluster, click Access Kibana.
In the navigation pane of Kibana, click Dev Tools.
Run the following command in Dev Tools to view the background status of the cluster:
```
GET _cluster/health?pretty
```
Figure 1 Viewing the cluster status

There are three possible background statuses of an Elasticsearch cluster:
- green: the cluster status is normal.
  The background cluster status is checked once every minute, so the cluster status in the Clusters page is not updated in real time. You can wait for several minutes and check whether the cluster status changes to normal. If the status is still Unavailable, contact technical support.
- yellow: the replica shards of the cluster are abnormal.
  initializing_shards indicates the number of shards that are being initialized. unassigned_shards indicates the number of shards that have not been allocated.
- red: the primary shards of the cluster are abnormal.
  initializing_shards indicates the number of shards that are being initialized. unassigned_shards indicates the number of shards that have not been allocated.
Figure 2 Viewing shard information
If there are shards being initialized, check whether the translog file is too large. When a primary shard is started, the translog file in the folder will be loaded automatically. A large translog file takes longer time for loading. Wait for about 10 minutes and check the cluster status again. If the status is still Unavailable, contact technical support.
If there are shards not allocated, perform the following steps:
1. Run the following command to check the reason:
```
GET /_cluster/allocation/explain?pretty
```
  Possible reasons:
  - INDEX_CREATED: An API for creating an index is called. If the disk usage exceeds 85%, CSS will not assign new shards to the node. In this case, release storage space by referring to .
  - CLUSTER_RECOVERED: Full data restoration of the cluster is performed.
  - INDEX_REOPENED: An index is opened or closed.
  - DANGLING_INDEX_IMPORTED: The dangling index results are imported.
  - NEW_INDEX_RESTORED: Data is restored to a new index.
  - EXISTING_INDEX_RESTORED: Data is restored to disabled indexes.
  - REPLICA_ADDED: Replica shards are added explicitly.
  - ALLOCATION_FAILED: Shard assignment failed.
  - NODE_LEFT: The node that carries the shards is not in the cluster now.
  - REINITIALIZED: Misoperations (such as using the shadow replica shard) were performed in the process from moving the shard to the shard initialization.
  - REROUTE_CANCELLED: The assignment is canceled because the routing is canceled explicitly.
  - REALLOCATED_REPLICA: A better replica location will be used, and the existing replica assignment is canceled.
2. Run the following command to re-allocate shards:
```
POST /_cluster/reroute?retry_failed=true
```
  Wait for about 15 minutes. If the cluster status changes to Available, the fault has been rectified. Otherwise, perform the next step.
3. If the shards are damaged and cannot be started, the shard reallocation failed. Run the following command to allocate an empty shard to the cluster:
```
POST _cluster/reroute
{
    "commands": [
        {
            "allocate_empty_primary": {
                "index": "index-test",//Index name
                "shard": 13,//Index number
                "node": "css-test -ess-esn-11-1",//Node name
                "accept_data_loss": true
            }
        }
    ]
}
```
  Wait for about 15 minutes. If the cluster status changes to Available, the fault has been rectified. Otherwise, contact technical support.

If you cannot log in to Kibana, perform the following steps:

If a node is faulty, CSS first starts the node daemon process to rectify the fault. If the rectification fails, CSS will report that the node is unavailable.

The following faults may cause the rectification failure:

The network between nodes (for example, eth1 and eth1, and eth2 and eth2) is faulty. Nodes cannot ping each other.
Check the network between nodes.
Heavy cluster load causes nodes downtime frequently.
Locate the unavailable cluster and click More > View Metric in the Operation column to view its current and previous CPU, memory, and load usage. Check whether these metrics increased sharply or remained high for a long time. The surges may be caused by the sudden increase of access to the cluster. You can view the number of HTTP connections to learn about cluster access. Nodes with high load, CPU, or memory usage may go offline.
Too many shards (more than 50,000) exist, so the cluster cannot be started. When shards are started, the metadata related to the shards will be loaded to the memory. Too many shards require high memory. If a node goes offline or a new index is created, the master node has to use more computing resources to re-allocate such a large number of shards.