分片未正常分配导致集群不可用

问题现象

“集群状态”为“不可用”。

在Kibana的“Dev Tools”页面，执行命令GET _cluster/health查看集群健康状态，结果中“status”为“red”，“unassigned_shards”不为0。或者在“Cerebro”可视化页面，单击“overview”查看索引分片在各数据节点的分配情况，可见集群状态为红色和“unassigned shards”不为0，表示集群存在无法分配的索引分片。

图1 集群健康状态

图2 Cerebro可视化界面

原因分析

集群出现不可用状态的原因是集群有索引分片未正常分配。

处理步骤

步骤一：确认集群不可用原因

通过Kibana接入故障集群，在Kibana的“Dev Tools”页面，执行命令GET /_recovery?active_only=true查看集群是否在进行副本恢复：
- 返回“{"index_name":{"shards":[{"id":25,"type":"...”，代表集群存在正在进行副本恢复的索引。等待副本恢复完毕，如果集群状态仍为“不可用”，则执行下一步。
- 返回“{ }”：代表集群未进行副本恢复，则执行下一步。

执行命令GET _cluster/allocation/explain?pretty查看索引分片未分配的原因，根据返回信息进行筛选。

表1 参数说明
参数	描述
index	索引名称
shard	分片标号
current_state	分片当前状态
allocate_explanation	分片分配解释
explanation	解释说明

表2 不同故障说明
现象	原因	处理步骤
“explanation”中存在“no allocations are allowed due to cluster setting [cluster.routing.allocation.enable=none]”	集群当前设置的allocation策略禁止所有分片分配。	参考“shard allocation策略配置错误”之▪cluster.routing.allocat...
“explanation”中存在“too many shards [3] allocated to this node for index [write08]index setting [index.routing.allocation.total_shards_per_node=3]”	集群当前设置的单个索引的分片允许分配给每个数据节点的分片数值过小，不满足索引分片的分配要求。	参考“shard allocation策略配置错误”之▪index.routing.allocatio...
“explanation”中存在“too many shards [31] allocated to this node, cluster setting [cluster.routing.allocation.total_shards_per_node=30]”	集群当前设置的集群所有索引分片允许分配给每个数据节点的分片数值太小。	参考“shard allocation策略配置错误”之▪cluster.routing.allocat...
“explanation”中存在“node does not match index setting [index.routing. allocation. include] filters [box_type:"hot"]”	索引分片需要下发到标记为“hot”的数据节点，而集群中所有数据节点都没有打这个标记时，分片无法下发。	参考“shard allocation策略配置错误”之▪index.routing.allocatio...
“explanation”中存在“node does not match index setting [index.routing. allocation. require] filters [box_type:"xl"]”	索引分片需要下发到特定标记的数据节点，而集群中所有节点都没有打这个标记时，分片便无法下发。	参考“shard allocation策略配置错误”之▪index.routing.allocatio...
“explanation”中存在“[failed to obtain in-memory shard lock]”	这种情况一般出现在有节点短暂离开集群，然后马上重新加入，并且有线程正在对某个shard做bulk或者scroll等长时间的写入操作，等节点重新加入集群的时候，由于shard lock没有释放，master无法allocate这个shard。	参考•shardlock错误
“explanation”中存在“node does not match index setting [index.routing.allocation.include] filters [_tier_preference:"data_hot OR data_warm OR data_cold"]”	集群的某个索引设置的参数与版本不匹配。	参考•索引参数版本不匹配
“explanation”中存在“cannot allocate because all found copies of the shard are either stale or corrupt”	集群的索引分片数据被损坏。	参考“主分片数据损坏”•主分片数据损坏
“explanation”中存在“the node is above the high watermark cluster setting [cluster.routing. allocation. disk.watermark.high=90%], using more disk space than the maximum allowed [90.0%], actual free: [6.976380997419324%]”	节点的磁盘使用率已超过磁盘空间允许的最大值。	参考“磁盘使用率过高”•磁盘使用率过高

步骤二：根据不同的问题现象处理故障

shard allocation策略配置错误
- cluster.routing.allocation.enable参数
  1. 返回结果中“explanation”如下，表示集群当前设置的allocation策略禁止所有分片分配导致。
    图3 allocation.enable参数配置错误
  2. 在Kibana的“Dev Tools”页面，执行命令将“enable”设置为“all”，允许所有分片进行分配。
```
PUT _cluster/settings
{
  "persistent": {
    "cluster": {
      "routing": {
        "allocation.enable": "all"
      }
    }
  }
}
```
    index级别会覆盖cluster级别配置，参数设置含义如下：
    
    all - (默认) 所有类型均允许allocation。
    
    primaries - 只允许allocation主分片。
    
    new_primaries - 只允许allocation新创建index的主分片。
    
    none - 所有的分片都不允许allocation。
  3. 再执行命令POST _cluster/reroute?retry_failed=true手动进行分片分配，等待索引分片分配完成，集群状态变为可用。
- index.routing.allocation.total_shards_per_node参数。
  1. 返回结果中“explanation”如下，表示设置的“index.routing.allocation.total_shards_per_node”值过小，不满足索引的分片分配要求。
    图4 index total_shards_per_node设置错误
  2. 在Kibana的“Dev Tools”页面，执行命令修改索引在每个节点允许分配的分片数。
```
PUT index_name/_settings
{
  "index": {
    "routing": {
      "allocation.total_shards_per_node": 3
    }
  }
}
```
    “index.routing.allocation.total_shards_per_node”的值 = index_name索引分片数 / (数据节点个数 - 1)
    
    参数值应设置稍微大一些，假设集群有10个节点，其中5个数据节点，2个client节点，3个master节点，有个索引的分片数为30，如果将total_shards_per_node值设为4，能分配的shard总数只有4*5=20，分片无法完全分配。5个数据节点，需要分配30个分片，每个节点应最少分配6个分片，防止某数据节点故障脱离，那最少应设置每个节点允许分配8个分片。
  3. 再执行命令POST _cluster/reroute?retry_failed=true手动进行分片分配，等待索引分片分配完成，集群状态变为可用。
- cluster.routing.allocation.total_shards_per_node参数。
  1. 返回结果中“explanation”如下，表示集群允许分配给每个数据节点的分片数设置太小。
    图5 cluster total_shards_per_node设置错误
  2. “cluster.routing.allocation.total_shards_per_node”参数为限制集群每个数据节点可分配的分片数量，此参数默认设置为“1000”，在Kibana的“Dev Tools”页面执行如下命令设置“cluster.routing.allocation.total_shards_per_node”参数。
```
PUT _cluster/settings
{
  "persistent": {
    "cluster": {
      "routing": {
        "allocation.total_shards_per_node": 1000
      }
    }
  }
}
```
  3. 出现此场景大多数是参数使用错误，误将“index.routing.allocation.total_shards_per_node”参数设置为“cluster.routing.allocation.total_shards_per_node”参数。执行如下命令可以设置“index.routing.allocation.total_shards_per_node”参数：
```
PUT index_name/_settings
{
  "index": {
    "routing": {
      "allocation.total_shards_per_node": 30
    }
  }
}
```
    两个参数都是限制单个数据节点所能分配的最大分片数。
    
    “cluster.routing.allocation.total_shards_per_node”是集群级别的分片限制。
    
    “index.routing.allocation.total_shards_per_node”是索引级别的分片限制。
  4. 再执行命令POST _cluster/reroute?retry_failed=true手动进行分片分配，等待索引分片分配完成，集群状态变为可用。
- index.routing.allocation.include参数。
  1. 返回结果中“explanation”如下，表示是将索引分片下发到标记为“hot”的数据节点，而集群中所有数据节点都没有打这个标记时，分片无法下发。
    图6 include参数配置错误
  2. 在Kibana的“Dev Tools”页面执行命令取消该配置：
```
PUT index_name/_settings
{
  "index.routing.allocation.include.box_type": null
}
```
  3. 再执行命令POST _cluster/reroute?retry_failed=true手动进行分片分配，等待索引分片分配完成，集群状态变为可用。
- index.routing.allocation.require参数。
  1. 返回结果中“explanation”如下，表示是将分片下发到特定标记的数据节点，而集群中所有节点都没有打这个标记时，分片便无法下发。
    图7 require参数配置错误
  2. 在Kibana的“Dev Tools”页面执行命令取消该配置：
```
PUT index_name/_settings
{
  "index.routing.allocation.require.box_type": null
}
```
  3. 再执行命令POST _cluster/reroute?retry_failed=true手动进行分片分配，等待索引分片分配完成，集群状态变为可用。
shard lock错误
1. 返回结果中“explanation”存在“[failed to obtain in-memory shard lock]”，这种情况一般出现在有节点短暂离开集群，然后马上重新加入，并且有线程正在对某个shard做bulk或者scroll等长时间的写入操作，等节点重新加入集群的时候，由于shard lock没有释放，master无法allocate这个shard。
2. 此现象不会造成分片数据丢失，只需要重新触发一下分配即可。在Kibana的“Dev Tools”页面执行命令POST /_cluster/reroute?retry_failed=true手动对未分配分片进行分配，等待索引分片分配完成，集群状态变为可用。
索引参数版本不匹配
1. 返回信息中的索引名称“index”和索引未分配解释“explanation ”: “node does not match index setting [index.routing.allocation.include] filters [_tier_preference:"data_hot OR data_warm OR data_cold"]”，表示集群的某个索引设置的参数与节点不匹配。
  图8 索引参数不匹配
2. 执行命令 GET index_name/_settings查看索引配置，返回结果中是否存在不符合自身版本的索引特性。
  图9 索引设置
  
  以index.routing.allocation.include._tier_preference特性为例，当前集群是7.9.3版本，这个索引特性是在7.10版本之后才支持的，低版本集群使用该特性将无法分配索引的分片，导致集群不可用。
3. 确定集群是否必须使用该不匹配的特性。
  - 是，创建与所需索引特性相匹配的版本集群，然后将老集群的数据通过备份恢复至新集群。
  - 否，执行下一步。
4. 执行命令去除索引中不符合集群版本的特性。
```
PUT /index_name/_settings
{
  "index.routing.allocation.include._tier_preference": null
}
```
5. 执行命令POST /_cluster/reroute?retry_failed=true手动对未分配分片进行分配，等待索引分片分配完成，集群状态变为可用。
主分片数据损坏
1. 返回信息中的索引名称"index"、分片标号"shard"、分配解释"allocate_explanation" 和"store_exception":"type":"corrupt index exception"，表示集群的某个索引的某个分片数据被损坏。
  图10 索引数据损坏
2. 当索引数据被损坏或者某个分片的主副本都丢失时，为了能使集群恢复green状态，解决方法是划分一个空shard，执行以下命令划分空分片并指定分配的节点。
```
POST /_cluster/reroute
{
    "commands" : [
            {
            "allocate_empty_primary" : {
                "index" : "index_name", 
                "shard" : 2,
                "node" : "node_name",
                "accept_data_loss":true
            }
        }
    ]
}
```
  一定要谨慎该操作，会导致对应分片的数据完全清空。
3. 索引分片重新分配后，集群状态恢复可用。
磁盘使用率过高
1. 返回结果如下，其中“allocate_explanation”表示该索引的分片无法分配给任何数据节点，“explanation”表示节点磁盘使用率已超过磁盘空间允许的最大值。
  图11 explain查询结果
  - 磁盘使用率超过85%：会导致新的分片无法分配。
  - 磁盘使用率超过90%：集群会尝试将对应节点中的分片迁移到其他磁盘使用率比较低的数据节点中。无法迁移时系统会对集群每个索引强制设置“read_only_allow_delete”属性，此时索引将无法写入数据，只能读取和删除对应索引。
  - 磁盘使用率过高时可能会发生节点脱离，后续节点自动恢复后也可能会因为集群压力过大，监控调ES接口查询集群状态时无响应，无法及时更新集群状态导致集群状态为不可用。
2. 增加集群可用磁盘容量。
  - 在Kibana的“Dev Tools”页面，执行命令DELETE index_name清理集群的无效数据释放磁盘空间。
  - 临时降低索引副本数，待扩容磁盘容量或扩容节点完成后改回索引副本数。
    1. 在Kibana的“Dev Tools”页面，执行命令临时降低索引副本数。
```
PUT index_name/_settings
{
  "number_of_replicas": 1
}
```
      如果返回结果如下：
      图12 索引read-only-allow-delete状态
      
      则是因为磁盘使用率已超过磁盘空间允许的最大值，集群所有索引被强制设置“read_only_allow_delete”属性，先执行命令将该属性值置为“null”，再执行2.a的命令降低索引副本数。
```
PUT /_settings
{
  "index.blocks.read_only_allow_delete": null
}
```
    2. 参考扩容对集群进行节点数量或节点存储容量进行扩容。
    3. 待扩容完成后再执行2.a改回索引副本数，待索引分片完全分配后，集群状态变为可用。