Help Center> Relational Database Service> User Guide> Working with RDS for PostgreSQL> Performance Tuning> Troubleshooting High Oldest Replication Slot Lag or Replication Lag
Updated on 2024-04-19 GMT+08:00

Troubleshooting High Oldest Replication Slot Lag or Replication Lag

Description

  • Oldest replication slot lag: lagging size of the most lagging replica in terms of WAL data received. You can run the following SQL statement to view the lags of replication slots used by multiple replicas:
    select slot_name, temporary, active,restart_lsn, confirmed_flush_lsn, master_lsn, pg_size_pretty(pg_wal_lsn_diff(master_lsn, a.restart_lsn)) as latency from pg_replication_slots a, pg_current_wal_lsn() as master_lsn;
  • Replication lag: delay between the time when data is written to the primary instance and the time when data is replicated to a replica.

High oldest replication slot lag or replication lag may have the following impacts:

  • The primary database server retains required WAL logs. WAL logs are stacked, occupying much storage space or even causing full storage space.
  • If the replication lag is high, WAL log playback on the replica is lower than log generation on the primary instance, affecting data timeliness.

Cause Analysis

In RDS for PostgreSQL, you need to pay attention to the oldest replication slot lag and replication lag between the primary instance and read replicas. The possible causes for abnormal lag metrics are as follows:

  • The workload on the primary instance is heavy.
  • The playback delay of the read replica is high.
  • There is a network delay between the primary instance and the read replica.

Troubleshooting and Solution

  1. The workload on the primary instance is heavy.

    Check whether there are many data writes or updates in the primary instance.

    On the Cloud Eye console, view the transaction logs generation metric of the primary instance. Check whether the metric value exceeds 40 MB/s in the target time period. (Generally, WAL logs are replayed at a speed of 40 MB/s on read replicas.)

    • If the metric value exceeds the threshold for a long period of time, the workload on the primary instance is heavy. In this case, optimize the workload.

      If the workload on the primary instance is heavy, the value of sent_lsn in the pg_stat_replication view is greatly different from the query result of select pg_current_wal_lsn(); executed on the primary instance. You can run the following SQL statements to check the difference:

      1. Confirm the read replica node information. Run the following SQL statement on the primary instance and record the value in the sent_lsn column as lsn1.
        select * from pg_stat_replication;
      2. Query the current WAL write location on the primary instance and record it as lsn2.
        select pg_current_wal_lsn();
      3. Calculate the distance between the WAL write location and the sent WAL location.
        select  pg_size_pretty(pg_wal_lsn_diff(lsn1,lsn2));
    • If the metric value does not exceed the threshold, go to the next step.
  2. The playback delay of the read replica is high.

    If the read replica has long-running transactions or is heavily loaded, the query on the read replica conflicts with the log replay. The read replica fails to send the query result to the primary instance, causing a replay delay. You can download the error logs of the read replica and check whether the error logs contain such errors.

    ERROR: canceling statement due to conflict with recovery
    Detail:  User query might have needed to see row versions that must be removed
    • If yes, perform the following operations:
      • Avoid long-running transactions. For details about how to troubleshoot long-running transactions, see Troubleshooting Long-Running Transactions.
      • Set hot_standby_feedback to on for the read replica to minimize query conflicts.
    • If no, go to the next step.
  3. There is a network delay between the primary instance and the read replica.

    If the value of sent_lsn queried on the primary instance is greatly different from that of pg_last_wal_receive_lsn queried on the read replica, there is a long network delay between the primary instance and read replica. In this case, contact Huawei Cloud customer service to locate the fault.

    To check the difference, run the following SQL statements:

    1. Run the following SQL statement on the primary instance and record the value in the sent_lsn column as lsn1:
      select * from pg_stat_replication;
    2. Query the value of pg_last_wal_receive_lsn() on the read replica and record it as lsn2.
      select pg_last_wal_receive_lsn();
    3. Calculate the distance between the sent WAL location of the primary instance and the received WAL location of the read replica.
      select  pg_size_pretty(pg_wal_lsn_diff(lsn1,lsn2));