Help Center> Cloud Search Service> Troubleshooting> Clusters> What Should I Do When a Cluster Is Unavailable?

What Should I Do When a Cluster Is Unavailable?

Issue

A CSS cluster is in the unavailable status.

Symptom

A cluster is displayed as Unavailable in the Cluster Status column, as shown in the following figure.

Possible Causes

When a cluster is unavailable, the CSS backend will report it to the console. The possible causes are as follows:

The cluster is abnormal or faulty.
The value of parameter status is red.

Procedure

When you can log in to Kibana or Cerebro,
1. Log in to Kibana or Cerebro of the target cluster.
2. Click Dev Tools in the navigation tree on the left of Kibana, as shown in the following figure.
  Figure 1 Clicking Dev Tools
3. Run the GET _cluster/health?pretty command on the Dev Tools page. Check the value of parameter status.
  Figure 2 Checking the value of parameter status
  - If the value of status is green, the cluster is normal. The cluster status is detected once every minute, which may cause a time error. You can wait for several minutes and then check the status. If the cluster is still unavailable, it is an error. In this case, submit a service ticket or contact technical support.
  - If the value of status is red, shards are not appropriately assigned. In this case, note the number of initializing and unassigned shards, as shown in the following figure:
    A cluster has the following three statuses:
    
    green indicates a cluster is normal.
    
    yellow indicates the replica shard is abnormal.
    
    red indicates the primary shard is abnormal, which often causes cluster unavailability.
4. Run the GET /_cluster/allocation/explain?pretty command to check the reason why shards are unassigned.
  You can also view the number of unassigned or initializing shards in Cerebro. You can also run commands on the rest page of Cerebro. For details on how to log in to Cerebro, see Cerebro.
  
  In the following scenarios, a primary shard of a cluster may be unavailable:
  - Shards are not assigned. Possible reasons are as follows:
    - INDEX_CREATED: An API for creating an index is called. In this case, check whether the disk usage exceeds 85%. If it exceeds 85%, CSS will not assign new shards to the node.
    - CLUSTER_RECOVERED: full data restoration of the cluster is performed.
    - INDEX_REOPENED: An index is opened or closed.
    - DANGLING_INDEX_IMPORTED: The dangling index results are imported.
    - NEW_INDEX_RESTORED: Data is restored to a new index.
    - EXISTING_INDEX_RESTORED: Data is restored to disabled indices.
    - REPLICA_ADDED: Replica shards are added explicitly.
    - ALLOCATION_FAILED: Shard assignment fails.
    - NODE_LEFT: The node that carries the shard is located outside of the cluster.
    - REINITIALIZED: Misoperations (such as the use of the shadow replica shard) exist in the process from moving the shard to the shard initialization.
    - REROUTE_CANCELLED: The assignment is canceled because the routing is canceled explicitly.
    - REALLOCATED_REPLICA: It is determined that a better replica location will be used, and the existing replica assignment is canceled.
  - Shards are initializing. Possible causes are as follows:
    This situation rarely occurs. Generally, translog may result in initializing primary shards. When a primary shard starts, it loads the translog file in the folder. If the translog file is too large, this process lasts for a long time.
When you cannot log in to Kibana or Cerebro,
CSS greatly improves cluster robustness. When a node is faulty, the daemon process of the node attempts to rectify the fault. Only when the rectification fails, an alarm indicating that the node is unavailable is reported. The possible causes for a rectification failure are as follows:
1. The cluster load is too heavy and nodes frequently go offline. Go to the Cloud Eye console and view the cluster monitoring metrics, such as current and previous CPU usage, memory, and load. Focus on the trend and check whether there are sharp increases or the metric remains high for a long time. The surges may be caused by the sudden increase of access to the cluster. You can view the number of HTTP connections to learn about cluster access. If the load, CPU, or memory usage is high, the node may go offline.
2. There are too many shards in the cluster. When shards are started, the metadata related to shards needs to be loaded to the memory. If there are too many shards, the memory usage is high. In addition, if a node is offline or a new index is created, the master node will consume more resources when calculating shard assignment because the number of shards is too large, increasing the cluster pressure.
3. If the fault persists after several attempts, submit a service ticket to get technical support.

Parent topic: Clusters

Last Article: Why Do I Fail to Access CSS Using TransportClient?

Next Article: Data Import and Export

Did this article solve your problem?

Thank you for your score！Your feedback would help us improve the website.

Products

Compute

Application

Dedicated Cloud

Storage

Management & Deployment

Migration

Network

Enterprise Intelligence

Video

Database

Edge Cloud Services

DevCloud

Security

Cloud Communications

Internet of Things

Solutions

Industry-Specific Solutions

General-Purpose Solutions

Security

DevOps

Enterprise Intelligence

Essential Platform

Big Data

Visual Cognition

Speech and Semantics

Support

Help Center

Customer Services

Developers

Console

语言 - Language

中国站 - 简体中文

中国站 - English

International - 简体中文

International - English