Updated on 2026-03-23 GMT+08:00

Troubleshooting WAL Accumulation

Description

PostgreSQL uses Write-Ahead Logging (WAL) to ensure data durability and consistency. WAL logs record all changes to databases and play a key role in primary/standby replication, crash recovery, and logical replication.

You need to note the following key metrics:

  • Transaction Logs Usage (WAL logs): rds040_transaction_logs_usage
  • Inactive Logical Replication Slots: inactive_logical_replication_slot
  • Oldest Replication Slot Lag (WAL logs accumulated due to replication slot problems): rds045_oldest_replication_slot_lag
  • Transaction Logs Generation: rds044_transaction_logs_generations

Troubleshooting and Solution

  • Check whether the WAL log size is within its allowed range. If not, correct the issue.

    Check rds040_transaction_logs_usage to see how much storage the WAL logs occupy. Check whether the WAL log size has significantly increased recently or whether WAL logs occupy a large proportion of the storage space.

    You can also run the SQL statement below to check the WAL log size. If there are too many WAL logs, perform the subsequent steps to locate the fault.

    select pg_size_pretty(sum(size)) from pg_ls_waldir();
  • Check the replication slot statuses and the size of logs that are not cleared in a timely manner.

    Replication slots can block WAL recycling. If the value of inactive_logical_replication_slot is not 0 and the value of rds045_oldest_replication_slot_lag is large or even the same as that of rds040_transaction_logs_usage, replication slots are blocking WAL recycling.

    Alternatively, you can run the following SQL statement to query the slot statuses and WAL lag:

    select slot_name, active,
    pg_size_pretty(pg_wal_lsn_diff(b, a.restart_lsn)) as slot_latency
    from pg_replication_slots as a, pg_current_wal_lsn() as b;

    If the query result shows that there is any slot whose active field is f and slot_latency field is large, inactive replication slots are blocking WAL recycling. WAL logs are accumulated and cannot be cleared.

    Analyze whether this slot is still required. If not, run the following SQL statement to delete it:

    select pg_drop_replication_slot('slot_name');
  • Check the WAL log retention parameters.

    Check whether the wal_keep_segments, wal_keep_size, and max_wal_size parameters are set too high.

    select name, setting from pg_settings where name in ('wal_keep_segments', 'wal_keep_size', 'max_wal_size');
    • For RDS for PostgreSQL 12 or earlier versions, check the value of wal_keep_segments. For later versions, check the value of wal_keep_size.
    • The WAL log retention settings should be less than 10% of the total storage and greater than 4 GB. If they are set too low, the primary instance may delete WAL logs that are still needed by the standby instance, causing exceptions on the standby instance.
  • Assess how busy write activities are.

    Use the rds044_transaction_logs_generations metric to assess write activities. This metric represents the average amount of transaction logs (WAL logs) generated per second.

    If the value of this metric is large (greater than 50 MB/s on average), there are a large number of write services. In this case, the database kernel reserves more WAL logs for recycling, and the storage usage of WAL logs increases. You are advised to scale up storage to ensure storage redundancy.

WAL Logs Generated While Backups Are Created

If the WAL log size increases during backup creation and returns to normal afterward, the WAL logs are generated faster than the backups.

You can adjust the backup schedule to avoid heavy data writes during backup creation or scale up storage to ensure enough cache space.