Help Center> Cloud Search Service> Troubleshooting> Clusters> What Should I Do When a Cluster Is Unavailable?

What Should I Do When a Cluster Is Unavailable?

Issue

A CSS cluster is in the unavailable status.

Symptom

A cluster is displayed as Unavailable in the Cluster Status column, as shown in the following figure.

Possible Causes

When a cluster is unavailable, the CSS backend will report it to the console. The possible causes are as follows:

  • The cluster is abnormal or faulty.
  • The value of parameter status is red.

Procedure

  • When you can log in to Kibana or Cerebro,
    1. Log in to Kibana or Cerebro of the target cluster.
    2. Click Dev Tools in the navigation tree on the left of Kibana, as shown in the following figure.
      Figure 1 Clicking Dev Tools
    3. Run the GET _cluster/health?pretty command on the Dev Tools page. Check the value of parameter status.
      Figure 2 Checking the value of parameter status
      • If the value of status is green, the cluster is normal. The cluster status is detected once every minute, which may cause a time error. You can wait for several minutes and then check the status. If the cluster is still unavailable, it is an error. In this case, submit a service ticket or contact technical support.
      • If the value of status is red, shards are not appropriately assigned. In this case, note the number of initializing and unassigned shards, as shown in the following figure:

        A cluster has the following three statuses:

        • green indicates a cluster is normal.
        • yellow indicates the replica shard is abnormal.
        • red indicates the primary shard is abnormal, which often causes cluster unavailability.
    4. Run the GET /_cluster/allocation/explain?pretty command to check the reason why shards are unassigned.

      You can also view the number of unassigned or initializing shards in Cerebro. You can also run commands on the rest page of Cerebro. For details on how to log in to Cerebro, see Cerebro.

      In the following scenarios, a primary shard of a cluster may be unavailable:

      • Shards are not assigned. Possible reasons are as follows:
        • INDEX_CREATED: An API for creating an index is called. In this case, check whether the disk usage exceeds 85%. If it exceeds 85%, CSS will not assign new shards to the node.
        • CLUSTER_RECOVERED: full data restoration of the cluster is performed.
        • INDEX_REOPENED: An index is opened or closed.
        • DANGLING_INDEX_IMPORTED: The dangling index results are imported.
        • NEW_INDEX_RESTORED: Data is restored to a new index.
        • EXISTING_INDEX_RESTORED: Data is restored to disabled indices.
        • REPLICA_ADDED: Replica shards are added explicitly.
        • ALLOCATION_FAILED: Shard assignment fails.
        • NODE_LEFT: The node that carries the shard is located outside of the cluster.
        • REINITIALIZED: Misoperations (such as the use of the shadow replica shard) exist in the process from moving the shard to the shard initialization.
        • REROUTE_CANCELLED: The assignment is canceled because the routing is canceled explicitly.
        • REALLOCATED_REPLICA: It is determined that a better replica location will be used, and the existing replica assignment is canceled.
      • Shards are initializing. Possible causes are as follows:

        This situation rarely occurs. Generally, translog may result in initializing primary shards. When a primary shard starts, it loads the translog file in the folder. If the translog file is too large, this process lasts for a long time.

  • When you cannot log in to Kibana or Cerebro,
    CSS greatly improves cluster robustness. When a node is faulty, the daemon process of the node attempts to rectify the fault. Only when the rectification fails, an alarm indicating that the node is unavailable is reported. The possible causes for a rectification failure are as follows:
    1. The cluster load is too heavy and nodes frequently go offline. Go to the Cloud Eye console and view the cluster monitoring metrics, such as current and previous CPU usage, memory, and load. Focus on the trend and check whether there are sharp increases or the metric remains high for a long time. The surges may be caused by the sudden increase of access to the cluster. You can view the number of HTTP connections to learn about cluster access. If the load, CPU, or memory usage is high, the node may go offline.
    2. There are too many shards in the cluster. When shards are started, the metadata related to shards needs to be loaded to the memory. If there are too many shards, the memory usage is high. In addition, if a node is offline or a new index is created, the master node will consume more resources when calculating shard assignment because the number of shards is too large, increasing the cluster pressure.
    3. If the fault persists after several attempts, submit a service ticket to get technical support.