Updated on 2025-05-22 GMT+08:00

Questions and Checklists

During application resilience design, this checklist can help identify areas for improvement and guide their enhancement. Each item in the checklist represents a best practice, which will be explained in detail in the next section.

Question

Checklist/Best Practice

RES01 What redundancy technologies do you use to ensure the high availability of your application system?

  1. High-availability deployment of application components
  2. Multi-location deployment of application components
  3. Anti-affinity for cloud servers

RES02 How do you back up critical data in your application?

  1. Identify and back up critical data that needs to be backed up.
  2. Automatically back up data.
  3. Periodically restore data from backups.

RES03 How do you enable cross-AZ disaster recovery for your application?

  1. Cross-AZ cluster deployment
  2. Cross-AZ data synchronization
  3. Interconnection with DR arbitrator for automatic switchover
  4. Disaster recovery management

RES04 How do you deploy cross-region or cross-cloud disaster recovery for your application?

  1. Define the RPO and RTO for the application system.
  2. Deploy the DR system to meet the DR objectives.
  3. Automate the DR process.
  4. Regularly perform DR drills to confirm that the recovery can meet objectives.

RES05 How do you ensure high availability of the networks?

  1. Ensure high availability for network connections.
  2. Avoid unnecessarily exposing network addresses.
  3. Isolate network bandwidth allocated to services using different traffic models.
  4. Reserve IP resources for expansion and high availability.

RES06 How do you detect faults?

  1. Fault mode analysis
  2. Fault detection
  3. Subhealth detection

RES07 How do you monitor application system resources?

  1. Define key metrics and thresholds and monitor such metrics.
  2. Monitor logging
  3. Send notifications when exceptions are detected.
  4. Store and analyze monitoring data.
  5. Track requests end to end.

RES08 How do you reduce the impact of dependencies?

  1. Reduce strong dependencies.
  2. Use loose coupling to reduce dependencies.
  3. Minimize the impacts of dependency failures.

RES09 How do you design a retry mechanism?

  1. Design a retry mechanism for API calls and command executions.
  2. Determine whether to retry from the client based on the comprehensive evaluation results.
  3. Avoid creating too much traffic pressure from excessive retries.

RES10 How do you isolate faults?

  1. Isolate the control plane from the data plane.
  2. Deploy application components in multiple locations.
  3. Adopt a grid architecture.
  4. Configure health check and automatic isolation.

RES011 How do you perform reliability tests?

  1. Chaos testing
  2. Load testing
  3. Long-term stability testing
  4. DR drills
  5. Red/blue team testing

RES012 How do you perform an emergency recovery?

  1. Set up an emergency recovery team.
  2. Develop an emergency response plan.
  3. Periodically conduct emergency recovery drills.
  4. Restore services as soon as possible after a problem occurs.
  5. Organize emergency recovery backtracking.

RES013 How do you implement overload protection to adapt to traffic changes?

  1. Automatic elastic scaling
  2. Application system load balancing
  3. Overload detection and traffic throttling
  4. Automatic capacity expansion
  5. Quota limiting for automatic capacity expansion
  6. Load testing

RES14 How do you configure error prevention?

  1. Make changes foolproof.
  2. Automate changes.
  3. Back up data before changes.
  4. Provide runbooks to standardize changes.

RES15 How do you perform upgrades without interrupting services?

  1. Automatic deployment and upgrade
  2. Automatic checks
  3. Automatic rollbacks
  4. Canary deployments and upgrades