Troubleshooting High Oldest Replication Slot Lag or Replication Lag

Description

Oldest replication slot lag: lagging size of the most lagging replica in terms of WAL data received. You can run the following SQL statement to view the lags of replication slots used by replicas:

select slot_name, temporary, active,restart_lsn, confirmed_flush_lsn, master_lsn, pg_size_pretty(pg_wal_lsn_diff(master_lsn, a.restart_lsn)) as latency from pg_replication_slots a, pg_current_wal_lsn() as master_lsn;

Replication lag: delay between the time when data is written to the primary instance and the time when data is replicated to a replica.

A high oldest replication slot lag or replication lag may have the following impacts:

The primary database server retains required WAL logs. WAL logs are stacked, occupying much storage space or even causing full storage space.
If the replication lag is high, WAL log playback on the replica is slower than log generation on the primary instance, so data cannot be synchronized to the replica in real time.

Cause Analysis

In RDS for PostgreSQL, pay attention to the oldest replication slot lag and replication lag between the primary instance and read replicas. The possible causes for increase in the metrics are the following:

Heavy workload on the primary instance
High playback delay for read replicas
Network delay between the primary instance and read replicas

Troubleshooting and Solution

Heavy workload on the primary instance
Check whether there are many data writes or updates in the primary instance.

On the Cloud Eye console, view the Transaction Logs Generation metric of the primary instance. Check whether the metric exceeds 40 MB/s in a given period. (Generally, WAL logs are replayed at a speed of 40 MB/s on read replicas.)
- If the metric exceeds this threshold for a long period of time, there is a heavy workload on the primary instance. In this case, optimize the workload.
  If the workload on the primary instance is heavy, the value of sent_lsn in the pg_stat_replication view is greatly different from the query result of select pg_current_wal_lsn(); executed on the primary instance. You can run the following SQL statements to check the difference:
  1. Confirm the read replica node information. Run the following SQL statement on the primary instance and record the value in the sent_lsn column as lsn1.
```
select * from pg_stat_replication;
```
  2. Query the current WAL write location on the primary instance and record it as lsn2.
```
select pg_current_wal_lsn();
```
  3. Calculate the distance between the WAL write location and the sent WAL location.
```
select  pg_size_pretty(pg_wal_lsn_diff(lsn1,lsn2));
```
- If the metric value does not exceed the threshold, go to the next step.
High playback delay for read replicas
If the read replica has long-running transactions or is heavily loaded, the query on the read replica conflicts with the log replay. The read replica fails to send the query result to the primary instance, causing a replay delay. You can download the error logs of the read replica and check whether such errors have been logged.
```
ERROR: canceling statement due to conflict with recovery
Detail:  User query might have needed to see row versions that must be removed
```
- If yes, perform the following operations:
  - Avoid long-running transactions. For details about how to troubleshoot long-running transactions, see Troubleshooting Long-Running Transactions.
  - Set hot_standby_feedback to on for the read replica to minimize query conflicts.
- If no, go to the next step.
Network delay between the primary instance and read replicas
If the value of sent_lsn queried on the primary instance is greatly different from that of pg_last_wal_receive_lsn queried on the read replica, there is a long network delay between the primary instance and read replica. In this case, contact Huawei Cloud customer service to locate the fault.

To check the difference, run the following SQL statements:
1. Run the following SQL statement on the primary instance and record the value in the sent_lsn column as lsn1:
```
select * from pg_stat_replication;
```
2. Query the value of pg_last_wal_receive_lsn() on the read replica and record it as lsn2.
```
select pg_last_wal_receive_lsn();
```
3. Calculate the distance between the sent WAL location of the primary instance and the received WAL location of the read replica.
```
select  pg_size_pretty(pg_wal_lsn_diff(lsn1,lsn2));
```