A Cluster Is Stuck in the Creating Snapshot State
If a cluster's task status is stuck in Creating snapshot, there are three possible causes:
- The cluster is under heavy load or contains large volumes of data, and snapshots take longer to create
When a cluster is under heavy load (indexing or query) or contains large volumes of data, snapshots take significantly longer time to create. By default, both Elasticsearch and OpenSearch set an upper limit on the per-node snapshot speed (typically 40 MB/s). Additionally, the overall snapshot creation speed depends on the cluster resource availability. When under heavy load, the snapshot creation process takes longer.
Solution:
- Query the snapshot progress: Use an API to check the snapshot progress and estimate the remaining time.
GET _snapshot/repo_auto/{snapshot_name}
Replace {snapshot_name} with the actual snapshot name.
In the response, check the shards_stats field (or a similar field, depending on the version) to learn the number of completed shards and the total number of shards.
- Choose a solution.
- Wait: If the estimated remaining time is acceptable or the cluster load is expected to decrease, you can wait for the snapshot to complete.
- Terminate: If the estimated remaining time is too long, call the snapshot deletion API to terminate the task.
DELETE _snapshot/repo_auto/{snapshot_name}
- Query the snapshot progress: Use an API to check the snapshot progress and estimate the remaining time.
- The snapshot status has failed to be updated
Both Elasticsearch and OpenSearch save the status information of ongoing snapshots in a place called cluster state. After a snapshot is created, the task status needs to be updated in cluster state. If the update fails (for example, the request for updating the cluster state is rejected by the thread pool or has expired due to high memory usage in the cluster), the snapshot status in cluster state will be stuck in the Creating state.
Solution:
Call the snapshot deletion API to delete the snapshot task. This will clear the residual status.
DELETE _snapshot/repo_auto/{snapshot_name}
- Temporary access credentials (AK/SK) have expired
When creating a snapshot repository, CSS obtains a pair of temporary access keys (AK/SK) through an agency mechanism and associates them with the repository to allow access to the designated OBS bucket. These temporary AK/SKs are typically valid for 24 hours.
If the snapshot creation task outlasts the validity period of the AK/SK, it will fail due to authentication failures. After the AK/SK expires, any operations on the snapshot repository (including querying the repository, deleting snapshots, and deleting the repository) will fail due to authentication failures. As a result, the status information of the failed snapshot (and possibly the associated repository information) will remain in cluster state and cannot be cleared using a regular API.
Solution:
Perform a Quick Restart to clear residual snapshot status information in cluster state. Note that Rolling Restart will not work because it relies on coordination between nodes and the correctness of cluster state. Only a Quick Restart will re-load the persistent cluster state, thus clearing invalid snapshot status information.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot