Updated on 2023-03-06 GMT+08:00

RDS for MySQL Backup Job Failure

Scenario

When a user runs mysqldump to back up RDS for MySQL data to an ECS in a different subnet from RDS, the backup job runs for 300 seconds and then fails.

Troubleshooting

Replace the ECS where the backup job is executed with an ECS that is in the same subnet as RDS. The backup job is successfully executed.

  • Network: There are no differences on latency and bandwidth between the two ECSs.
  • Database: The net_write_timeout parameter is set to 300 on the RDS for MySQL database. The connection between the ECS and RDS for MySQL is interrupted after 300s regardless of whether data writes have been completed.

Procedure

  1. Identify the backup data flow, protocol, and port.

    mysqldump uses TCP to connect to port 8635 of RDS. After the connection is established, the backup job starts.

  2. Compare the hardware configuration and OS version of the two ECSs.

    1. Both of them use the same hardware configuration: two cores and 6 GB of memory.
    2. Both of them use the same OS version: CentOS 7.4.

  3. Check whether the NIC rates are the same.
  4. Check whether the kernel parameter settings are the same. The result shows that the network parameters on the ECS where the backup job failed are not optimized.

  5. Set the kernel parameters of the ECS where the backup job failed to the same as those of the ECS where the backup job succeeded. Start a backup job again. The backup job is successful.

Solution

There is a large volume of data writes during the backup process across networks. The data write capability and TCP buffer on the backup end do not match the sending capability of the RDS. When the timeout period reaches the preset threshold (300s), the backup job failed. You can increase the TCP buffer by modifying the ECS kernel parameters to resolve this issue.