Performing a Fault Drill of High Memory Usage on COC
Scenarios
A new application for an e-commerce company has been deployed in the production environment, and they plan to officially launch it for access and traffic. However, their traditional O&M mode is mainly reactive, lacking proactive O&M concepts and tool capabilities. Before the application went live, there was no effective way to identify availability issues, and after it went live, the availability status could not be accurately grasped. The O&M team lacked emergency response capabilities and practical experience. They hope to use chaos engineering to test the application's architectural resilience in the production environment before launching it to ensure that there are no major stability risks during the official launch.
Solutions
Chaos drills drive proactive O&M: Starting from the customer's actual business scenario, we provide end-to-end chaos drill capabilities based on four dimensions: risk analysis, contingency plan selection, drill execution, and review and improvement.
Fault precipitation mode: We have pioneered a fault scenario analysis method based on a fault-tolerant perspective and have accumulated a library of failure modes from the Huawei Cloud SRE team's years of experience, which includes over 300 typical fault modes.
- Risk analysis: Analyze the application architecture to identify risks.
- Emergency plan: Designate contingency plans for the identified risks.
- Fault drill: Based on the results of the risk analysis and emergency plans, specify the drill plan and conduct fault drills.
- Review and improvement: After the drill is completed, summarize the drill and output the drill report and improvement items.
Core Advantages
- Pioneered the FT-FMEA fault scenario analysis method based on a fault-tolerant perspective, gradually incorporating 300+ failure modes.
- Supports multi-dimensional attack scenarios, covering both virtualization and containerization.
- Supports custom attack process orchestration to meet individual customer business needs.
Prerequisites
- An application group has been created. For details, see Application Management.
- UniAgent has been installed on the resources required for chaos drills. For details, see Installing UniAgent.
Step 1: Create a Failure Mode
Check whether the application to which the target host or container belongs and the incident level are correct.
- Log in to COC.
- In the navigation pane, choose Resilience Center > Chaos Drills.
- On the Failure Modes tab page, click Create Failure Mode.
- Enter failure mode information. This example describes only the mandatory parameters. Retain the preset values for other parameters.
Table 1 Parameters for creating a failure mode Parameter
Description
Failure Mode
User-defined failure mode name, for example, Drill-Test.
Scenario Category
Select Node. Node faults include CPU and memory overload or process faults of the host, which may cause service exceptions, such as CPU or memory overload, or abnormal process status.
Incident Level
Select P4.
By default, P1 incidents are the most critical, while P5 incidents are the least severe.
Source
Select Proactive analysis. Proactively analyze risks in the application architecture and running environment to form a failure mode.
Attack Scenario
Select an attack scenario from the drop-down list, for example, memory usage increase.
Enterprise Project
Select the enterprise project to which the failure mode resource belongs from the drop-down list, for example, default.
Application
Select the application to which the drill target belongs from the drop-down list, for example, COC.
Contingency Plan Available
Select No.
Occurrence Condition
Enter the conditions under which the fault may occur, for example, the service volume increases sharply and exceeds the service design specifications.
The information can contain a maximum of 1,024 characters.
Fault Symptom
Enter the possible service symptom when the fault occurs. For example, the memory usage of the VM where the service is located is too high, causing slow response to service requests.
The information can contain a maximum of 1,024 characters.
Impact on Customer
Enter the impact on customers caused by the fault. For example, slow service request response or failure of some service requests during flow control.
The information can contain a maximum of 1,024 characters.
- Set Contingency Plan Available. If you select Yes, enter a contingency plan name to search for the plan, select the plan, and click Save.
Step 2: Create a Contingency Plan
- In the navigation pane, choose Resilience Center > Contingency Plans.
- On the Customized Plan tab page, click Create. Figure 1 Creating a contingency plan

- Configure the basic information. This example describes only the mandatory parameters. Retain the preset values for other parameters.
Table 2 Basic information parameters Parameter
Description
Contingency Plan
Enter a contingency plan name, for example, contingency plan 01.
Enterprise Project
Select the enterprise project to which the contingency plan belongs from the drop-down list, for example, default.
Application
Select the application to which the target instance, where the fault will be injected, belongs. For example, COC.
- Set Troubleshooting.
- Contingency Plan Type: Automation Plan.
- Handling Method: Select Script and select the corresponding script from the drop-down list.
- Click OK.
Step 3: Create a Drill Plan
You can designate an executor to create a drill plan. The executor creates a drill task by receiving a service ticket and associates it with a failure more and region.
- In the navigation pane, choose Resilience Center > Chaos Drills.
- On the Drill Plans tab page, click Create Drill Plan.
- Select the failure mode created in Step 1: Create a Failure Mode, and set the executor, region, and planned drill time. Figure 2 Creating a drill plan

- Click OK.
Step 4: Create a Drill Task
- Choose Resilience Center > Chaos Drills > Drill Plan as the executor. On the displayed page, locate the drill plan created in Step 3: Create a Drill Plan and click Receive in the Operation column.
- Enter basic information about the drill task.
Table 3 Basic information parameters Parameter
Description
Drill Task
User-defined the drill task name. In this example, the preset value is used.
Expected Recovery Duration (Minutes)
Expected time from the fault occurrence to the fault recovery, in minutes, for example, 3 minutes.
Expected time for an application to automatically recover to the normal state during contingency plan execution after a fault is injected. This time does not affect the drill task.
Figure 3 Basic information
- There is one attack task group by default in the drill task. Click Create Attack Task. Figure 4 Creating an attack task

- Set the attack task by referring to Table 4.
Table 4 Parameters for adding an attack task Parameter
Description
Cloud Service Provider
Select a cloud vendor, for example, Huawei Cloud.
Source of Attack Target
Select the source of the target instance, for example, ECS.
Attack Task
The attack task name is automatically generated.
Attack Target
Select the target instance.

- Click Next and select an attack scenario.
Table 5 Parameters for selecting an attack scenario Parameter
Description
Attack Type
Attack scenario type, for example, host type.
Attack Scenario
Select an attack scenario, for example, memory usage increase.
Attack Parameters
Configure attack parameters based on attack scenarios.
- Memory Usage (%): 80
- Fault Duration (s): 60
- Click Next.
- Set a monitoring task. Figure 5 Setting a monitoring task

- Select Memory Usage as the stable indicator. The threshold range is 1 to 96.
- Select the memory usage as the monitoring metric. The threshold range is 0 to 60.
- Click Finish.
- Click OK.
Step 5: Start the Drill Task
Start the drill for the created drill task.
- Select the drill task created in Step 4: Create a Drill Task and click Start Drill in the Operation column.
- In the displayed dialog box, click OK.
After the drill is started, the drill details page is displayed. The chaos drill platform automatically performs fault injection based on the drill task settings.
Step 6: Create a Drill Report
Once a drill is finished, you can create a drill report.
- After the drill task is complete, click Drill Report in the upper right corner of the drill details page.
- Click Edit Duration to change the actual recovery duration. Figure 6 Changing the actual recovery duration

Table 6 Parameters for modifying the actual restoration duration Parameter
Description
Fault Detection Duration (Minutes)
Enter the fault detection duration.
Duration from the time when the fault injection is complete to the time when the fault alarm is received.
Fault Demarcation Duration (Minutes)
Enter the fault demarcation duration.
Duration from the time when an alarm is reported to the time when the fault demarcation is complete
Fault Rectification Duration (Minutes)
Enter the fault rectification duration.
Time from fault demarcation to fault rectification.
- Click OK.
After the actual recovery duration is changed, the system automatically generates a recovery capability score.
- Create an improvement ticket in the Improvement Tickets module.
Table 7 Parameters for creating an improvement ticket Parameter
Description
Improvement Task
Name of an improvement ticket. For example, drill test improvement.
The name can contain a maximum of 64 characters, including letters, digits, hyphens (-), underscores (_), and spaces. It cannot start or end with a space.
Application
Select an application for which the improvement is performed from the drop-down list, for example, COC.
Type
Select an improvement type from the drop-down list, for example, O&M improvement.
Improvement Owner
Select an owner from the drop-down list.
Improvement Acceptor
Select an acceptance user from the drop-down list.
Expected Completion
Enter the expected completion time (accurate to day). The selected date cannot be earlier than today.
Symptom
Enter the incident-related problem symptom.
The value can contain a maximum of 1,000 characters.
Improvement Ticket Closure Criteria
Enter the improvement closure criteria.
The value can contain a maximum of 1,000 characters.
- Click OK.
After the ticket is created, the improvement item list is expanded by default, including the processing information and verification information.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot
