Help Center/ Cloud Operations Center/ Best Practices/ E2E Chaos Engineering

Updated on 2024-04-19 GMT+08:00

View PDF

E2E Chaos Engineering

Scenario

A new application for an e-commerce company has been deployed in the production environment, and they plan to officially launch it for access and traffic. However, their traditional O&M mode is mainly reactive, lacking proactive O&M concepts and tool capabilities. Before the application went live, there was no effective way to identify availability issues, and after it went live, the availability status could not be accurately grasped. The O&M team lacked emergency response capabilities and practical experience. They hope to use chaos engineering to test the application's architectural resilience in the production environment before launching it to ensure that there are no major stability risks during the official launch.

Solution

Chaos drills drive proactive O&M: Starting from the customer's actual business scenario, we provide end-to-end chaos drill capabilities based on risk analysis, contingency plans, exercise execution, and retrospective improvement.

Fault precipitation mode: We have pioneered a fault scenario analysis method based on a fault-tolerant perspective and have accumulated a library of fault modes from Huawei Cloud SRE's years of experience, which includes over 300 typical fault modes.

Risk analysis: Analyze the application architecture to identify risks.
Contingency plan: Designate contingency plans for the identified risks.
Fault drill: Based on the results of the risk analysis and emergency plans, specify the drill plan and conduct fault drills.
Review and improvement: After the drill is completed, summarize the drill and output the drill report and improvement items.

Core Advantages

Pioneered the FT-FMEA fault scenario analysis method based on a fault-tolerant perspective, gradually incorporating 300+ fault modes.
Supports multi-dimensional attack scenarios, covering both virtualization and containerization.
Supports custom attack process orchestration to meet individual customer business needs.

Prerequisites

An application group has been created on the application management page.

The resources for conducting chaos drills have UniAgent installed. For details, see "Installing the UniAgent".

Step 1: Failure Mode

Check whether the application to which the target host or container belongs and the incident level are correct.

Log in to COC.
In the navigation pane on the left, choose Resilience Center > Chaos Drills. On the displayed page, click Risk Management Tasks and then the Failure Modes tab.
Figure 1 Failure Modes tab page

Enter failure mode information.

Figure 2 Creating a failure mode

**Table 1** Failure mode parameters
Parameter	Description
Failure Mode	Custom failure mode name
Application	Application the drill object belongs to
Incident Level	See the incident center page.
Source	The options are Failure modes detected proactively and Existing failure modes.
Contingency Plan	For details, see the contingency plan section.
Scenario Category	Failure scenario. The options are Redundancy, DR, Overloading, Configuration, and Dependencies.
Occurrence Conditions	Possible conditions that cause the failure

Set Contingency Plan Available. If you select Yes, enter a contingency plan name to search for the plan, select the plan, and click Save.

Step 2: Contingency Plan

Select the application to which the target host, where the fault will be injected, belongs.

Log in to COC.
In the navigation pane on the left, choose Resilience Center > Chaos Drills. On the displayed page, click Risk Management Tasks and then the Contingency Plans tab.
Figure 3 Contingency Plans tab page

Enter basic information about the contingency plan.

Figure 4 Creating a contingency plan

**Table 2** Contingency plan parameters
Parameter	Description
Contingency Plan	Custom contingency plan name
Application	Application to which the target host or container belongs
Description	Description about the contingency plan
Contingency Plan Attachment	Emergency recovery guide for practicing abnormal situations

During the drill, unexpected abnormal situations may occur, so you should prepare emergency measures in advance and have the emergency recovery guide ready. Click Upload to upload it and then click OK.

Step 3: Drill Planning

You can designate an executor to create a drill plan. The executor creates a drill task by receiving a service ticket and associates it with a failure more and region.

Log in to COC.
In the navigation pane on the left, choose Resilience Center > Chaos Drills. On the displayed page, click Risk Management Tasks and then the Drill Plans tab.
Figure 5 Drill Plans tab page
Click Create Drill Plan. In the displayed dialog box, set Failure Mode, Executed By, Region, and Planned Drill Time, and click OK.
Figure 6 Creating a drill plan
The executor clicks Accept in the Operation column. The page for creating a drill task is displayed. The drill task is associated with the specified failure mode and region. Moreover, you can track the progress of drill tasks.
Figure 7 Switching to the page for creating a drill task

Step 4: Drill Task

Create a drill task on COC.

Log in to COC.
In the navigation pane on the left, choose Resilience Center > Chaos Drills. On the displayed page, click the Drill Tasks tab.
Click Create Task.
Figure 8 Creating a drill task
Enter basic information about the drill task, including Drill Task and Expected Recovery Duration (Minutes).
Figure 9 Basic information
Select an attack task. By default, there is one attack task group. You can click Create Task Group to add a task group or click Create Attack Task to access the page for creating an attack task.
Figure 10 Selecting an attack task
On the displayed Create Attack Task page, you can select Create Attack Task or Select from Existing. If you have not created an attack task before, you will need to select Create Attack Task. However, if you have created attack tasks previously, you can select Select from Existing.
Creating an attack task: Select an attack target and then an attack scenario. Different attack targets correspond to different attack scenarios. Enter the attack task name. The attack target source can be Elastic Cloud Server (ECS) or Cloud Container Engine (CCE). If you select the former, you will need to select the corresponding server from the list below and click Next.
Figure 11 Selecting ECS as the attack target source
Select an attack scenario, set attack parameters, and click OK. The scenarios include Host Resource, Host Process, and Host Network.
Figure 12 ECS attack scenarios
If you select Cloud Container Engine (CCE) as the attack target source, you will need to select an application and pod (select a cluster, namespace, workload type, and workload in sequence). You can specify pods or the number of pods, and click Next.
Figure 13 Selecting CCE as the attack target source and specifying a pod

Figure 14 Selecting CCE as the attack target source and specifying the quantity
Select a CCE attack scenario, set attack parameters, and click OK. The scenarios include Weapons Attacking POD Instances, Weapons Attacking POD Processes, and Weapons Attacking the POD Network.
Figure 15 CCE attack scenarios
If you select Select from Existing, select the created attack task from the task list below and click OK.
Figure 16 Selecting an existing attack task
Click OK. The drill task is created.
Figure 17 Clicking OK

Step 5: Drill Report

Once a drill is finished, you can create a drill report.

Log in to COC.
In the navigation pane on the left, choose Resilience Center > Chaos Drills. On the displayed page, click the Drill Tasks tab.
Figure 18 Drill task list
Locate the row containing the finished drill task and click Drill Record in the Operation column. In the displayed drill record list, locate a desired drill record, click Create Report or View Progress in the Operation column. On the displayed Drill Record Detail page, click Create Drill Report on the right.
Figure 19 Drill record list

Figure 20 Drill Record Detail page
Go to the drill report page and update the report name.
Figure 21 Drill report details
On the drill report details page, enter the drill duration and click OK.
Figure 22 Modify Drill Duration

Step 6: Review and Improvement

Once you have created a drill report, you can include suggestions for improvement.

Log in to COC.
In the navigation pane on the left, choose Resilience Center > Chaos Drills. On the displayed page, click the Drill Tasks tab.
Figure 23 Drill task list
Click Drill Record.
Access the drill record list and click View Report or Create Report.
Figure 24 Drill record list

Access the drill report details page, click Create Improvement Ticket on the right, and enter information about the improvement ticket.

Figure 25 Creating an improvement ticket

**Table 3** Improvement ticket parameters
Parameter	Description
Improvement Task	Improvement task name
Application	Application the improvement task belongs to
Type	Type of the improvement task
Improvement Owner	Owner of the improvement task
Expected Completion	Expected completion time of the improvement task
Symptom	Symptom
Improvement Ticket Closure Criteria	Criteria for the closure of the improvement ticket