E2E Chaos Engineering
Scenario
A new application for an e-commerce company has been deployed in the production environment, and they plan to officially launch it for access and traffic. However, their traditional O&M mode is mainly reactive, lacking proactive O&M concepts and tool capabilities. Before the application went live, there was no effective way to identify availability issues, and after it went live, the availability status could not be accurately grasped. The O&M team lacked emergency response capabilities and practical experience. They hope to use chaos engineering to test the application's architectural resilience in the production environment before launching it to ensure that there are no major stability risks during the official launch.
Solution
Chaos drills drive proactive O&M: Starting from the customer's actual business scenario, we provide end-to-end chaos drill capabilities based on risk analysis, contingency plans, exercise execution, and retrospective improvement.
Fault precipitation mode: We have pioneered a fault scenario analysis method based on a fault-tolerant perspective and have accumulated a library of fault modes from Huawei Cloud SRE's years of experience, which includes over 300 typical fault modes.
- Risk analysis: Analyze the application architecture to identify risks.
- Contingency plan: Designate contingency plans for the identified risks.
- Fault drill: Based on the results of the risk analysis and emergency plans, specify the drill plan and conduct fault drills.
- Review and improvement: After the drill is completed, summarize the drill and output the drill report and improvement items.
Core Advantages
- Pioneered the FT-FMEA fault scenario analysis method based on a fault-tolerant perspective, gradually incorporating 300+ fault modes.
- Supports multi-dimensional attack scenarios, covering both virtualization and containerization.
- Supports custom attack process orchestration to meet individual customer business needs.
Prerequisites
An application group has been created on the application management page.
The resources for conducting chaos drills have UniAgent installed. For details, see "Installing the UniAgent".
Step 1: Failure Mode
Check whether the application to which the target host or container belongs and the incident level are correct.
- Log in to COC.
- In the navigation pane on the left, choose Resilience Center > Chaos Drills. On the displayed page, click Risk Management Tasks and then the Failure Modes tab.
Figure 1 Failure Modes tab page
- Enter failure mode information.
Figure 2 Creating a failure mode
Table 1 Failure mode parameters Parameter
Description
Failure Mode
Custom failure mode name
Application
Application the drill object belongs to
Incident Level
See the incident center page.
Source
The options are Failure modes detected proactively and Existing failure modes.
Contingency Plan
For details, see the contingency plan section.
Scenario Category
Failure scenario. The options are Redundancy, DR, Overloading, Configuration, and Dependencies.
Occurrence Conditions
Possible conditions that cause the failure
- Set Contingency Plan Available. If you select Yes, enter a contingency plan name to search for the plan, select the plan, and click Save.
Step 2: Contingency Plan
Select the application to which the target host, where the fault will be injected, belongs.
- Log in to COC.
- In the navigation pane on the left, choose Resilience Center > Chaos Drills. On the displayed page, click Risk Management Tasks and then the Contingency Plans tab.
Figure 3 Contingency Plans tab page
- Enter basic information about the contingency plan.
Figure 4 Creating a contingency plan
Table 2 Contingency plan parameters Parameter
Description
Contingency Plan
Custom contingency plan name
Application
Application to which the target host or container belongs
Description
Description about the contingency plan
Contingency Plan Attachment
Emergency recovery guide for practicing abnormal situations
- During the drill, unexpected abnormal situations may occur, so you should prepare emergency measures in advance and have the emergency recovery guide ready. Click Upload to upload it and then click OK.
Step 3: Drill Planning
You can designate an executor to create a drill plan. The executor creates a drill task by receiving a service ticket and associates it with a failure more and region.
- Log in to COC.
- In the navigation pane on the left, choose Resilience Center > Chaos Drills. On the displayed page, click Risk Management Tasks and then the Drill Plans tab.
Figure 5 Drill Plans tab page
- Click Create Drill Plan. In the displayed dialog box, set Failure Mode, Executed By, Region, and Planned Drill Time, and click OK.
Figure 6 Creating a drill plan
- The executor clicks Accept in the Operation column. The page for creating a drill task is displayed. The drill task is associated with the specified failure mode and region. Moreover, you can track the progress of drill tasks.
Figure 7 Switching to the page for creating a drill task
Step 4: Drill Task
Create a drill task on COC.
- Log in to COC.
- In the navigation pane on the left, choose Resilience Center > Chaos Drills. On the displayed page, click the Drill Tasks tab.
- Click Create Task.
Figure 8 Creating a drill task
- Enter basic information about the drill task, including Drill Task and Expected Recovery Duration (Minutes).
Figure 9 Basic information
- Select an attack task. By default, there is one attack task group. You can click Create Task Group to add a task group or click Create Attack Task to access the page for creating an attack task.
Figure 10 Selecting an attack task
- On the displayed Create Attack Task page, you can select Create Attack Task or Select from Existing. If you have not created an attack task before, you will need to select Create Attack Task. However, if you have created attack tasks previously, you can select Select from Existing.
- Creating an attack task: Select an attack target and then an attack scenario. Different attack targets correspond to different attack scenarios. Enter the attack task name. The attack target source can be Elastic Cloud Server (ECS) or Cloud Container Engine (CCE). If you select the former, you will need to select the corresponding server from the list below and click Next.
Figure 11 Selecting ECS as the attack target source
- Select an attack scenario, set attack parameters, and click OK. The scenarios include Host Resource, Host Process, and Host Network.
Figure 12 ECS attack scenarios
- If you select Cloud Container Engine (CCE) as the attack target source, you will need to select an application and pod (select a cluster, namespace, workload type, and workload in sequence). You can specify pods or the number of pods, and click Next.
Figure 13 Selecting CCE as the attack target source and specifying a pod
Figure 14 Selecting CCE as the attack target source and specifying the quantity
- Select a CCE attack scenario, set attack parameters, and click OK. The scenarios include Weapons Attacking POD Instances, Weapons Attacking POD Processes, and Weapons Attacking the POD Network.
Figure 15 CCE attack scenarios
- If you select Select from Existing, select the created attack task from the task list below and click OK.
Figure 16 Selecting an existing attack task
- Click OK. The drill task is created.
Figure 17 Clicking OK
Step 5: Drill Report
Once a drill is finished, you can create a drill report.
- Log in to COC.
- In the navigation pane on the left, choose Resilience Center > Chaos Drills. On the displayed page, click the Drill Tasks tab.
Figure 18 Drill task list
- Locate the row containing the finished drill task and click Drill Record in the Operation column. In the displayed drill record list, locate a desired drill record, click Create Report or View Progress in the Operation column. On the displayed Drill Record Detail page, click Create Drill Report on the right.
Figure 19 Drill record list
Figure 20 Drill Record Detail page
- Go to the drill report page and update the report name.
Figure 21 Drill report details
- On the drill report details page, enter the drill duration and click OK.
Figure 22 Modify Drill Duration
Step 6: Review and Improvement
Once you have created a drill report, you can include suggestions for improvement.
- Log in to COC.
- In the navigation pane on the left, choose Resilience Center > Chaos Drills. On the displayed page, click the Drill Tasks tab.
Figure 23 Drill task list
- Click Drill Record.
- Access the drill record list and click View Report or Create Report.
Figure 24 Drill record list
- Access the drill report details page, click Create Improvement Ticket on the right, and enter information about the improvement ticket.
Figure 25 Creating an improvement ticket
Table 3 Improvement ticket parameters Parameter
Description
Improvement Task
Improvement task name
Application
Application the improvement task belongs to
Type
Type of the improvement task
Improvement Owner
Owner of the improvement task
Expected Completion
Expected completion time of the improvement task
Symptom
Symptom
Improvement Ticket Closure Criteria
Criteria for the closure of the improvement ticket
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot