Creating and Managing Failure Modes

Scenarios

A failure mode refers to a specific type of problem or failure status that may occur during application running. Build a rich failure mode library and formulate corresponding prevention and recovery measures to help design a more highly available application system. By identifying potential faults, you can perform routine drills to verify whether the fault recovery measures and fault impacts meet the expectations and prepare for better response to various challenges. You can analyze the possible fault points of an application, create a failure mode by describing the fault occurrence conditions, fault symptoms, and customer impacts, and apply the failure mode to routine chaos drills.

Precautions

Verify that the enterprise project, application, event level, and scenario category of the failure mode are correct.

Creating a Failure Mode

Log in to COC.
In the navigation tree on the left, choose Resilience Center > Chaos Drills.
On the Failure Modes tab page, click Create Failure Mode.

Set parameters for creating a failure mode.

**Table 1** Parameters for creating a failure mode
Parameter	Description
Failure Mode	Custom failure mode name
Scenario Category	The options are Node, Cluster, Network, DR, Container, and Service and Data. Node: The CPU or memory of the host is overloaded, or the process is faulty. As a result, services are abnormal, for example, the CPU or memory is overloaded, or the process status is abnormal. Cluster: Simulate abnormal scenarios by increasing the pressure or performing an active/standby cluster switchover, for example, increasing the pressure of the container cluster and performing an active/standby switchover in the database cluster. Network: Inject network faults to hosts or clusters to verify the DR capability of your service. Such faults include packet loss, network latency, and intermittent disconnection at the link layer. DR: Simulate inter-region network exceptions or service unavailability in a single region to verify the self-recovery capabilities of services. Container: Simulate process and resource faults and network attacks on container instances, such as CPU and memory pressure increase, network attacks, system OOM, or process killing. Service and data: Simulate service exceptions caused by database or file exceptions, such as database table deletion and database unavailability.
Incident Level	The options are P1, P2, P3, P4, and P5. By default, P1 incidents are the most critical, while P5 incidents are the least severe.
Source	The options are Failure modes detected proactively and Existing failure modes. Proactive analysis: Proactively analyze risks in the application architecture and running environment to form a failure mode. Existing faults: A failure mode is formed based on the analysis of existing faults and incidents.
Alarm ID	(Optional) ID of the alarm that is triggered when a fault occurs.
Attack Scenario	(Optional) Select an attack scenario from the drop-down list. A maximum of 10 attack scenarios can be selected.
Enterprise Project	Select the enterprise project to which the failure mode resource belongs from the drop-down list.
Application	Select the application to which the drill target belongs from the drop-down list.
Contingency Plan Available	This feature toggle can be enabled or disabled.
Contingency Plan Available	This parameter is mandatory when Contingency Plan Available is enabled. Select a contingency plan from the drop-down list. If no best-fit contingency plans are available, create one. For details, see Creating a Custom Contingency Plan.
Occurrence Conditions	Enter the conditions under which the fault may occur. The value can contain a maximum of 1,024 characters.
Fault Symptom	Enter the possible service symptom when the fault occurs. The value can contain a maximum of 1,024 characters.
Impact on Customer	Failure impact on customers. The value can contain a maximum of 1,024 characters.

Click OK.

The failure mode is created.

Cloning a Failure Mode

Only excellent failure modes can be cloned. For details about the excellent failure modes, see pre-defined failure modes on COC.

Log in to COC.
In the navigation tree on the left, choose Resilience Center > Chaos Drills.
Choose Failure Modes > Excellent Failure Mode Cases.
Locate the failure mode you want to clone and click Clone in the Operation column.

Set parameters for cloning a failure mode.

**Table 2** Parameters for cloning a failure mode
Parameter	Description
Failure Mode	Custom failure mode name
Scenario Category	The options are Node, Cluster, Network, DR, Container, and Service and Data. Node: The CPU or memory of the host is overloaded, or the process is faulty. As a result, services are abnormal, for example, the CPU or memory is overloaded, or the process status is abnormal. Cluster: Simulate abnormal scenarios by increasing the pressure or performing an active/standby cluster switchover, for example, increasing the pressure of the container cluster and performing an active/standby switchover in the database cluster. Network: Inject network faults to hosts or clusters to verify the DR capability of your service. Such faults include packet loss, network latency, and intermittent disconnection at the link layer. DR: Simulate inter-region network exceptions or service unavailability in a single region to verify the self-recovery capabilities of services. Container: Simulate process and resource faults and network attacks on container instances, such as CPU and memory pressure increase, network attacks, system OOM, or process killing. Service and data: Simulate service exceptions caused by database or file exceptions. Such faults include packet loss, network latency, and intermittent disconnection at the link layer.
Incident Level	The options are P1, P2, P3, P4, and P5. By default, P1 incidents are the most critical, while P5 incidents are the least severe.
Source	The options are Failure modes detected proactively and Existing failure modes. Proactive analysis: Proactively analyze risks in the application architecture and running environment to form a failure mode. Existing faults: A failure mode is formed based on the analysis of existing faults and incidents.
Alarm ID	(Optional) ID of the alarm that is triggered when a fault occurs.
Attack Scenario	(Optional) Select an attack scenario from the drop-down list. A maximum of 10 attack scenarios can be selected.
Enterprise Project	Select the enterprise project to which the failure mode resource belongs from the drop-down list.
Application	Select the application to which the drill target belongs from the drop-down list.
Contingency Plan Available	This feature toggle can be enabled or disabled.
Contingency Plan Available	This parameter is mandatory when Contingency Plan Available is enabled. Select a contingency plan from the drop-down list. If no best-fit contingency plans are available, create one. For details, see Creating a Custom Contingency Plan.
Occurrence Conditions	Enter the conditions under which the fault may occur. The value can contain 1 to 1,024 characters.
Fault Symptom	Enter the possible service symptom when the fault occurs. The value can contain 1 to 1,024 characters.
Impact on Customer	Failure impact on customers. The value can contain 0 to 1,024 characters.