Help Center/ GaussDB/ FAQs/ Database Usage/ What Do I Do If Replay Speed of Standby DNs Cannot Catch Up with Write Speed of Primary DN?
Updated on 2024-09-13 GMT+08:00

What Do I Do If Replay Speed of Standby DNs Cannot Catch Up with Write Speed of Primary DN?

Symptom

When workloads on a DB instance are heavy, the replay speed of standby DNs cannot catch up with the write speed of the primary DN. After the system runs for a long time, logs are accumulated on the standby DNs. If the primary DN is faulty, data restoration takes a long time and the database is unavailable, severely affecting system availability.

Solution

GaussDB provides ultimate RTO to minimize the data recovery time after a primary DN is faulty and improve availability.

To use ultimate RTO, submit an application by choosing Service Tickets > Create Service Ticket in the upper right corner of the console.

Precautions

  • Ultimate RTO focuses only on whether the RTO of the standby DN meets the requirements. Ultimate RTO has no inherent flow control and uses the recovery_time_target parameter for flow control instead.
  • Ultimate RTO uses multi-page redo threads to accelerate the replay progress. When the replay on the standby DN catches up with that on the primary DN and the standby DN is unloaded, the CPU usage of a single page redo thread is about 15% (the actual value depends on the hardware and parameter configuration). Total CPU usage of the replay on the standby DN = CPU usage of a single page redo thread x Number of page redo threads. Because more threads are started, the CPU and memory consumption is higher than that of parallel replay and serial replay.
  • Ultimate RTO supports read on standby nodes. Because historical data pages are read, the query performance on the standby DNs is worse than that on the primary DN and worse than that of read on standby nodes during parallel redo. However, query blocking is alleviated.
  • The replay speed of DDL logs is much slower than that of page modification logs. Frequent DDL operations may increase the primary/standby latency.
  • When the I/O and CPU usage of a node is too high (it is recommended that the I/O and CPU usage be less than or equal to 70%), the performance of replay and read on standby nodes deteriorates significantly.