Troubleshooting WAL Accumulation

Description

PostgreSQL uses Write-Ahead Logging (WAL) to ensure data durability and consistency. WAL logs record all changes to databases and play a key role in primary/standby replication, crash recovery, and logical replication.

You need to note the following key metrics:

Transaction Logs Usage (WAL logs): rds040_transaction_logs_usage
Inactive Logical Replication Slots: inactive_logical_replication_slot
Oldest Replication Slot Lag (WAL logs accumulated due to replication slot problems): rds045_oldest_replication_slot_lag
Transaction Logs Generation: rds044_transaction_logs_generations

Troubleshooting and Solution

Check whether the WAL log size is within its allowed range. If no, rectify the fault.
Check rds040_transaction_logs_usage to see how much storage the WAL logs occupy. Check whether the WAL log size has significantly increased recently or whether WAL logs occupy a large proportion of the storage space.

You can also run the SQL statement below to check the WAL log size. If there are too many WAL logs, perform the subsequent steps to locate the fault.
```
select pg_size_pretty(sum(size)) from pg_ls_waldir();
```
Check the replication slot statuses and the size of logs that are not cleared in a timely manner.
Replication slots can block WAL recycling. If the value of inactive_logical_replication_slot is not 0 and the value of rds045_oldest_replication_slot_lag is large or even the same as that of rds040_transaction_logs_usage, replication slots are blocking WAL recycling.

Alternatively, you can run the following SQL statement to query the slot statuses and WAL lag:
```
select slot_name, active,
pg_size_pretty(pg_wal_lsn_diff(b, a.restart_lsn)) as slot_latency
from pg_replication_slots as a, pg_current_wal_lsn() as b;
```
If the query result shows that there is any slot whose active field is f and slot_latency field is large, inactive replication slots are blocking WAL recycling. WAL logs are accumulated and cannot be cleared.

Analyze whether this slot is still required. If not, run the following SQL statement to delete it:
```
select pg_drop_replication_slot('slot_name');
```
Check the WAL log retention parameters.
Check whether the wal_keep_segments, wal_keep_size, and max_wal_size parameters are properly set.
```
select name, setting from pg_settings where name in ('wal_keep_segments', 'wal_keep_size', 'max_wal_size');
```
- For RDS for PostgreSQL 12 or earlier versions, check the value of wal_keep_segments. For later versions, check the value of wal_keep_size.
- The values of the WAL log retention parameters should be greater than 4 GB but less than 10% of the total storage. If they are too small, the primary instance may clear the WAL logs required by the standby instance, causing exceptions on the standby instance.
Check how busy write services are.
View the rds044_transaction_logs_generations metric to determine how busy write services are. This metric indicates the average size of transaction logs (WAL logs) generated per second.

If the value of this metric is large (greater than 50 MB/s on average), there are a large number of write services. In this case, the database kernel reserves more WAL logs for recycling, and the storage usage of WAL logs increases. You are advised to scale up storage to ensure storage redundancy.