RDS for MySQL Backup Job Failure
Scenario
When a user ran the mysqldump command to back up RDS for MySQL data to an ECS that is in a different subnet from RDS, the backup job failed after running for 300 seconds.
Possible Causes
Replace the ECS where the backup job is executed with an ECS that is in the same subnet as RDS. The backup job is successfully executed.
- Network: There are no differences on latency and bandwidth between the two ECSs.
- Database: The net_write_timeout parameter is set to 300 on the RDS for MySQL database. The connection between the ECS and RDS for MySQL is interrupted after 300s regardless of whether data writes have been completed.
Procedure
- Understand the backup data flow, protocol, and port.
mysqldump uses TCP to connect to port 8635 of RDS. After the connection is established, the backup job starts.
- Compare the hardware configuration and OS version of the two ECSs.
- Both of them use the same hardware configuration: two cores and 6 GB of memory.
- Both of them use the same OS version: CentOS 7.4.
- Check whether the NIC rates are the same.
- Check whether the kernel parameter settings are the same. The result shows that the network parameters on the ECS where the backup job failed are not optimized.
- Set the kernel parameters of the ECS where the backup job failed to the same as those of the ECS where the backup job succeeded. Start a backup job again. The backup job is successful.
Solution
There is a large volume of data writes during the backup process across networks. The data write capability and TCP buffer on the backup end do not match the sending capability of the RDS. When the timeout period reaches the preset threshold (300s), the backup job failed. You can increase the TCP buffer by modifying the ECS kernel parameters to resolve this issue.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.