Automatic Recovery of Extended Primary/Standby Replication Delay

Scenario

The primary/standby replication delay of a DB instance was long, kept increasing for a period of time, and then automatically recovered.

The following figure is an example showing how the real-time replication delay metric changes on the Cloud Eye console.

Possible Causes

According to Primary/Standby Replication Delay Scenarios and Solutions and How Primary/Standby Replication Works, this problem is caused by large transactions or DDL operations.

You can analyze full logs or slow query logs to check whether there are large transactions or DDL operations.

As shown in the following figure, if a DDL operation for adding an index was recorded in the slow query logs, the table contained hundreds of millions of data records, and the execution took about one day, the replication delay kept increasing when the DDL operation was replayed on the read replica or standby node. After the DDL operation was replayed, the replication delay dropped back to the normal range.

Click to enlarge