Updated on 2025-05-22 GMT+08:00

RES11-01 Chaos Testing

Chaos engineering is the practice of injecting faults into a system to test its stability and fault tolerance.

  • Risk level

    High

  • Key strategies
    • Perform chaos testing in a real environment.
    • Perform chaos testing in a CI/CD pipeline.
    • Proactively inject faults to detect and rectify faults before they occur in the real world.
    • Inject faults in a controllable manner to reduce the impacts on customers.

    Chaos engineering metrics:

    • Fault scenario coverage: The coverage of fault scenarios is analyzed, such as 80% of redundancy scenarios and 60% of overload scenarios.
    • Hit rate of a fault scenario: This is the actual occurrence rate of the fault scenario.
    • Contingency plan quality: assesses both the effectiveness and execution efficiency of contingency plans.
    • Number and severity of identified risks: regularly, typically quarterly or annually, quantify and assess the severity of risks.
    • Number, severity, and type of mitigated risks: the number of downgraded risks, the number of mitigated risks, the number of added contingency plans, and the number of improved monitoring items
    • Improvement rate of fault recovery duration: reflects how quickly the average recovery time improves as a result of chaos engineering drills.
    • YoY fault decrease: the decrease in the number of faults in the current year compared to the previous year
  • Related cloud services and tools
    • MAS-CAST: provides test tools and injection methods for cloud applications. It supports reliability testing, stress testing, random fault injection in chaos engineering, and fault drills in the production environment for fault and service process orchestration.
    • COC: allows you to conduct automatic chaos drills covering from risk identification, emergency plan management, fault injection, and review and improvement.