Runbook Procedure Design

Each step in the runbook includes a clear procedure, operation command or script, serial or parallel marking, operator, confirmer, estimated start and end times, and expected duration. The runbook's steps depend on the cutover method. There are two types: service interruption and no service interruption. No service interruption requires major changes to the application structure. Because of this, service interruption is more common. The following uses service suspension as an example to list key points for designing a runbook.

Designing forward cutover procedure
Refine the forward cutover steps in the document based on the cutover solution. Consider these aspects:
1. Announce the service interruption ahead of time to consider user experience.
2. Consider the system's availability mechanism during service interruptions. Some systems automatically restart applications upon detecting they have stopped. Therefore, you should first disable the availability mechanism to prevent the risk of applications failing to stop.
3. Consider data consistency during database cutover. Keep the data at the source end static and disconnect the incremental synchronization task to ensure consistency. Plan the data consistency check carefully. Decide if you need to compare row counts or content based on the table's importance and the cutover time.
4. To make the source end data static, stop applications and consider message consumption in batch processing tasks and message queues.
5. Applications and scheduled tasks usually start and stop in order. To prevent service issues, organize the startup and stop sequence of these applications and tasks.
6. Public network DNS caches domain names. After switching systems, some traffic may still go to the source end. The runbook should address this DNS cache issue. Keep the forwarding path from the source to the destination end for a while. Watch the traffic, then cut off the forwarding path.

Designing rollback procedure
If a serious problem happens during the cutover and cannot be fixed quickly, you must roll back to restore the system to its previous state. This stops any lasting damage to services. Here are the main points for creating service rollback steps:
1. If a forward operation fails, you might need to roll back. So, consider all rollback scenarios.
2. Rollback types include lossy and lossless. Lossless rollback happens when no new data is added at the destination, allowing data to return to its original state. If new data appears at the destination and cannot be synced back to the source, the rollback is lossy, causing data loss. To avoid data loss with new destination data, create a sync task from the destination to the source.
3. In large-scale service applications, rollback can be full or partial, depending on the service impact. For example, if 10 application systems and 10 databases switch on the same day, the service team must check if the cross-cloud access latency between applications and databases meets the requirements. If a database fails to switch, decide whether to roll back all databases or just that one.

When designing a cutover runbook, consider the rollback process. Create a clear rollback plan and procedure. Assign specific operators and follow the steps closely. This ensures that if issues arise during the cutover, the rollback can happen smoothly, avoiding service disruptions.

Parent topic: Designing a Runbook

Previous topic: Runbook Checklist Design

Next topic: Runbook Reference Template