Questions and Checklists

During application resilience design, this checklist can help identify areas for improvement and guide their enhancement. Each item in the checklist represents a best practice, which will be explained in detail in the next section.

Question	Checklist/Best Practice
RES01 What redundancy technologies do you use to ensure the high availability of your application system?	High-availability deployment of application components Multi-location deployment of application components Anti-affinity for cloud servers
RES02 How do you back up critical data in your application?	Identify and back up critical data that needs to be backed up. Automatically back up data. Periodically restore data from backups.
RES03 How do you enable cross-AZ disaster recovery for your application?	Cross-AZ cluster deployment Cross-AZ data synchronization Interconnection with DR arbitrator for automatic switchover Disaster recovery management
RES04 How do you deploy cross-region or cross-cloud disaster recovery for your application?	Define the RPO and RTO for the application system. Deploy the DR system to meet the DR objectives. Automate the DR process. Regularly perform DR drills to confirm that the recovery can meet objectives.
RES05 How do you ensure high availability of the networks?	Ensure high availability for network connections. Avoid unnecessarily exposing network addresses. Isolate network bandwidth allocated to services using different traffic models. Reserve IP resources for expansion and high availability.
RES06 How do you detect faults?	Fault mode analysis Fault detection Subhealth detection
RES07 How do you monitor application system resources?	Define key metrics and thresholds and monitor such metrics. Monitor logging Send notifications when exceptions are detected. Store and analyze monitoring data. Track requests end to end.
RES08 How do you reduce the impact of dependencies?	Reduce strong dependencies. Use loose coupling to reduce dependencies. Minimize the impacts of dependency failures.
RES09 How do you design a retry mechanism?	Design a retry mechanism for API calls and command executions. Determine whether to retry from the client based on the comprehensive evaluation results. Avoid creating too much traffic pressure from excessive retries.
RES10 How do you isolate faults?	Isolate the control plane from the data plane. Deploy application components in multiple locations. Adopt a grid architecture. Configure health check and automatic isolation.
RES011 How do you perform reliability tests?	Chaos testing Load testing Long-term stability testing DR drills Red/blue team testing
RES012 How do you perform an emergency recovery?	Set up an emergency recovery team. Develop an emergency response plan. Periodically conduct emergency recovery drills. Restore services as soon as possible after a problem occurs. Organize emergency recovery backtracking.
RES013 How do you implement overload protection to adapt to traffic changes?	Automatic elastic scaling Application system load balancing Overload detection and traffic throttling Automatic capacity expansion Quota limiting for automatic capacity expansion Load testing
RES14 How do you configure error prevention?	Make changes foolproof. Automate changes. Back up data before changes. Provide runbooks to standardize changes.
RES15 How do you perform upgrades without interrupting services?	Automatic deployment and upgrade Automatic checks Automatic rollbacks Canary deployments and upgrades