OPS03-05 Performing Chaos Testing and Drills

Chaos engineering is a practice of deliberately injecting faults into a system to test its resilience and reliability.

Risk level
High

Key strategies
Chaos engineering introduces faults into the system to assess its robustness under various failure scenarios. This approach tests system capabilities in fault tolerance, monitoring, emergency response, demarcation and locating, and rapid recovery.

High availability design and verification: During the planning and design phase, the service system's architecture is crafted with HA and monitoring. Before launch, production readiness review (PRR) and performance testing verify that the system consistently delivers stable and reliable services. Chaos engineering constructs drill scenarios that examine multiple facets of the system, from application deployment architecture and service capacity to monitoring, alarms, and HA. These drills follow a structured progression: beginning with testing, advancing through attack-defense scenarios, and culminating in surprise attacks. Continuous drills verify critical capabilities online, such as architecture HA, monitoring, and PRR, to establish dynamic, ongoing risk governance. Chaos drills and HA design operate as dual engines to ensure enduring system stability.

System risk mitigation and rapid service recovery: This involves analyzing potential risks (fault scenarios) and designing contingency plans. Then, it checks the fault scenario coverage and hit rate, and validates both the coverage and execution efficiency of these contingency plans. The outcome is a streamlined approach that minimizes faults and accelerates recovery, ultimately realizing deterministic recovery.

Fewer faults: Identifying potential risks early is critical. This entails assessing the severity and impact of each fault scenario and testing the ability to mitigate these risks through regular drills.

Fast recovery: By proactively introducing failures, O&M and development teams become well-acquainted with various fault scenarios and can validate contingency plans. This practice ultimately accelerates the recovery process.

Chaos engineering metrics:
- Fault scenario coverage: analyzes the extent that fault scenarios are covered. For example, the redundancy scenario coverage is 80% and the overload scenario coverage is 60%.
- Hit rate of a fault scenario: measures the actual occurrence of a fault scenario.
- Contingency plan quality: assesses both the effectiveness and execution efficiency of contingency plans.
- Number and severity of identified risks: regularly, typically quarterly or annually, quantify and assess the severity of risks that are proactively identified.
- Number, severity, and type of mitigated risks: track risk mitigation, including the number of risks downgraded or mitigated, the addition of contingency plans, and enhancements in monitored items.
- Improvement rate of fault recovery duration: reflects how quickly the average recovery time improves as a result of chaos engineering drills.
- Year-over-year (YoY) fault decrease: compares the total number of faults in the current year against the previous year, showing the decrease in the number of faults over time.
Related cloud services and tools
MAS: Chaos Engineering

COC: Chaos Drills