Tag Archives: ELK

ES Error: failed shard on node [xxxxxx]: failed recovery

Today, a node in the ES cluster is down, and the cluster state is red all of a sudden. After restarting, I waited for a long time, and found that there were several pieces that could not be recovered. The running command is as follows:

curl -XGET localhost:9200/_cluster/allocation/explain?pretty        
{
  "index" : "twitter",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2018-11-06T06:11:15.562Z",
 "failed_allocation_attempts" : 5,                                                                       [0/819]
    "details" : "failed shard on node [CxXWE8BiQbS4ThB9AvvGQA]: failed recovery, failure RecoveryFailedException[[t
witter][0]: Recovery failed on {node-1}{CxXWE8BiQbS4ThB9AvvGQA}{yYDvXMKnS9KhaIlzPEsJNg}{10.142.0.2}{10.142.0.2:9300
}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: EngineCreationFailureException[fai
led to create engine]; nested: CorruptIndexException[misplaced codec footer (file truncated?): length=0 but footerL
ength==16 (resource=SimpleFSIndexInput(path=\"/var/lib/elasticsearch/nodes/0/indices/l1VcSQySRmuyFGTBBPjX9g/0/trans
log/translog-1228.ckp\"))]; ",
    "last_allocation_status" : "no"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes that hold an in
-sync shard copy",
  "node_allocation_decisions" : [
    {
      "node_id" : "CxXWE8BiQbS4ThB9AvvGQA",
      "node_name" : "node-1",
      "transport_address" : "10.142.0.2:9300",
      "node_decision" : "no",
      "store" : {
        "in_sync" : true,
        "allocation_id" : "gxegPAMyQa21MH5NxQEACw"
      },
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - man
ually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2018-11-
06T06:11:15.562Z], failed_attempts[5], delayed=false, details[failed shard on node [CxXWE8BiQbS4ThB9AvvGQA]: failed
 recovery, failure RecoveryFailedException[[twitter][0]: Recovery failed on {node-1}{CxXWE8BiQbS4ThB9AvvGQA}{yYDvXM
KnS9KhaIlzPEsJNg}{10.142.0.2}{10.142.0.2:9300}]; nested: IndexShardRecoveryException[failed to recover from gateway
]; nested: EngineCreationFailureException[failed to create engine]; nested: CorruptIndexException[misplaced codec f
ooter (file truncated?): length=0 but footerLength==16 (resource=SimpleFSIndexInput(path=\"/var/lib/elasticsearch/n
odes/0/indices/l1VcSQySRmuyFGTBBPjX9g/0/translog/translog-1228.ckp\"))]; ], allocation_status[deciders_no]]]"
        }
      ]
    }
  ]

The reason is that the fragmentation attempt on a node failed to recover five times, and then it was discarded. The fragment cannot be recovered.

terms of settlement:

POST /_cluster/reroute?retry_failed=true

Restore the failed partition, and the cluster will return to green soon.