Logstash Write Failures

Symptom

Logstash is a data processing pipeline that ingests data from designated sources (such as Kafka or a file system), transforms the data, and writes the results to destinations like Elasticsearch or OpenSearch. Failures during the write phase can occur for various reasons, with typical symptoms including:

HTTP status codes 403 and 429, "out of disk" errors, or out-of-memory (OOM) exceptions are recorded in Logstash logs.
The Logstash pipeline fails to start.
The destination cluster contains frozen indexes, full disks, or nodes that are unavailable.

When Logstash write failures occur, data availability is affected for downstream applications.

Possible Causes

Destination cluster issues: abnormal index status (for example, frozen or creation failure); insufficient cluster resources (for example, the number of shards has reached the upper limit, or insufficient disk space); authentication failures or account lockout; loss of network connectivity; or abnormal node status (red or yellow).
Logstash issues: incorrect settings (for example, a large batch.size that leads to OOM); insufficient disk space (/opt/logstash/log is full); resource exhaustion (contention for CPU/memory/I/O resources); or pipeline startup or health check failures.
Issues in the data source or an intermediate link: unstable data source (for example, Kafka exception); incorrect data formats; or invalid output caused by a filter plugin failure.

Solutions

Check whether the Logstash pipeline is started successfully.
1. Log in to the CSS management console.
2. In the navigation pane on the left, choose Clusters > Logstash.
3. In the cluster list, click the name of the target cluster. The cluster information page is displayed.
4. Click the Configuration Center tab.
On the Configuration Center page, check whether the Logstash pipeline is Running.
- Yes: Go to the next step.
- No: There might be a network connectivity issue. Handle it by referring to Scenario 1: Loss of network connectivity.
Check the run logs of the Logstash configuration file.
On the Configuration Center page, click Run Logs. Check the log file to analyze error information. Pay attention to error messages containing elasticsearch, action, pipeline, cluster_block_exception, disk, and out.

For example:
- retrying failed action with response code: 403 ... index preparing to close. : This indicates a frozen index. Handle it by referring to Scenario 3: Frozen indexes.
- Blocked user <username>: The password is incorrect or the account is locked out. Handle it by referring to Scenario 5: Authentication failure or account lockout.
- max retries or cluster block: The disk is full or the index is read-only. Handle it by referring to Scenario 4: Read-only indexes (full disks).
- Java Heap space or Out of Memory: Logstash is out of memory. Handle it by referring to Scenario 6: Logstash node OOM.

Check the status of the destination cluster and determine whether the destination cluster is responsible for the failure.

Commonly used commands for troubleshooting:

GET /_cluster/health?pretty   //Check the cluster's health status (Green/Yellow/Red).
GET /_cat/indices?v or Kibana Stack Monitoring   //Check whether the destination index exists, and check its status, health, and disk usage.
GET /_nodes/stats?human&filter_path=**.fs.*   // Check the disk usage of the cluster nodes.
GET /_cluster/settings   //Check the cluster.max_shards_per_node setting.

Solutions for different issues:

Scenario 1: Loss of network connectivity
Configure Logstash cluster routes: Add routes that contain the source and destination clusters' private IP addresses or CIDR blocks for the Logstash cluster to ensure that Logstash can connect to these clusters.
Scenario 2: The number of existing shards has reached the upper limit in the destination cluster, and new indexes cannot be created.
Increase the value of cluster.max_shards_per_node in the destination cluster.

This is a cluster-level setting that remains effective until it is overwritten or cleared. Ensure that the new value is within the range supported by the hardware capacity of all nodes.
```
PUT /_cluster/settings
{
  "persistent": {
    "cluster.max_shards_per_node": "< New greater value > "
  }
}
```
After the change is made, wait for Logstash to try to create new indexes again.
Scenario 3: Frozen indexes
Delete the frozen indexes. Logstash will automatically try to recreate the indexes and continue writing data into them.
```
DELETE /<frozen_index_name>
```
Scenario 4: Read-only indexes (full disks)
1. Identify nodes with full disks.
  On the cluster's Intelligent O&M page, check the diagnostic item Data Node Disk Usage Check to identify nodes with full disks.
2. Expand the disk capacity: Switch to larger disks for these nodes.
3. Clear the disks: Delete obsolete snapshots and indexes, or manually delete files.
4. After the disk space is reclaimed, the index status will be restored to green.
5. The indexes are no longer read-only. Wait for Logstash to write data to them again.
Scenario 5: Authentication failure or account lockout
For security-mode clusters, change the cluster accounts and passwords in the Logstash configuration file to ensure that Logstash can access these clusters, and then restart the Logstash configuration file. If relevant accounts have been locked out, contact technical support to unlock them.
Scenario 6: Logstash node OOM
Method 1: Expand the Logstash cluster. Add more nodes to the Logstash cluster to rebalance the write load and prevent OOM on individual nodes.

Method 2: Modify the Logstash configuration file and tune relevant runtime parameters. For example, reduce the value of pipeline.batch.size. Reducing pipeline.batch.size can reduce the amount of data written per batch, thereby reducing the peak memory usage. The downside is that this may increase the needed write batches and reduce the throughput.