Updated on 2024-04-19 GMT+08:00

E2E Chaos Engineering

Scenario

A new application for an e-commerce company has been deployed in the production environment, and they plan to officially launch it for access and traffic. However, their traditional O&M mode is mainly reactive, lacking proactive O&M concepts and tool capabilities. Before the application went live, there was no effective way to identify availability issues, and after it went live, the availability status could not be accurately grasped. The O&M team lacked emergency response capabilities and practical experience. They hope to use chaos engineering to test the application's architectural resilience in the production environment before launching it to ensure that there are no major stability risks during the official launch.

Solution

Chaos drills drive proactive O&M: Starting from the customer's actual business scenario, we provide end-to-end chaos drill capabilities based on risk analysis, contingency plans, exercise execution, and retrospective improvement.

Fault precipitation mode: We have pioneered a fault scenario analysis method based on a fault-tolerant perspective and have accumulated a library of fault modes from Huawei Cloud SRE's years of experience, which includes over 300 typical fault modes.

  • Risk analysis: Analyze the application architecture to identify risks.
  • Contingency plan: Designate contingency plans for the identified risks.
  • Fault drill: Based on the results of the risk analysis and emergency plans, specify the drill plan and conduct fault drills.
  • Review and improvement: After the drill is completed, summarize the drill and output the drill report and improvement items.

Core Advantages

  • Pioneered the FT-FMEA fault scenario analysis method based on a fault-tolerant perspective, gradually incorporating 300+ fault modes.
  • Supports multi-dimensional attack scenarios, covering both virtualization and containerization.
  • Supports custom attack process orchestration to meet individual customer business needs.

Prerequisites

An application group has been created on the application management page.

The resources for conducting chaos drills have UniAgent installed. For details, see "Installing the UniAgent".

Step 1: Failure Mode

Check whether the application to which the target host or container belongs and the incident level are correct.

  1. Log in to COC.
  2. In the navigation pane on the left, choose Resilience Center > Chaos Drills. On the displayed page, click Risk Management Tasks and then the Failure Modes tab.
    Figure 1 Failure Modes tab page
  3. Enter failure mode information.
    Figure 2 Creating a failure mode
    Table 1 Failure mode parameters

    Parameter

    Description

    Failure Mode

    Custom failure mode name

    Application

    Application the drill object belongs to

    Incident Level

    See the incident center page.

    Source

    The options are Failure modes detected proactively and Existing failure modes.

    Contingency Plan

    For details, see the contingency plan section.

    Scenario Category

    Failure scenario. The options are Redundancy, DR, Overloading, Configuration, and Dependencies.

    Occurrence Conditions

    Possible conditions that cause the failure

  4. Set Contingency Plan Available. If you select Yes, enter a contingency plan name to search for the plan, select the plan, and click Save.

Step 2: Contingency Plan

Select the application to which the target host, where the fault will be injected, belongs.

  1. Log in to COC.
  2. In the navigation pane on the left, choose Resilience Center > Chaos Drills. On the displayed page, click Risk Management Tasks and then the Contingency Plans tab.
    Figure 3 Contingency Plans tab page
  3. Enter basic information about the contingency plan.
    Figure 4 Creating a contingency plan
    Table 2 Contingency plan parameters

    Parameter

    Description

    Contingency Plan

    Custom contingency plan name

    Application

    Application to which the target host or container belongs

    Description

    Description about the contingency plan

    Contingency Plan Attachment

    Emergency recovery guide for practicing abnormal situations

  4. During the drill, unexpected abnormal situations may occur, so you should prepare emergency measures in advance and have the emergency recovery guide ready. Click Upload to upload it and then click OK.

Step 3: Drill Planning

You can designate an executor to create a drill plan. The executor creates a drill task by receiving a service ticket and associates it with a failure more and region.

  1. Log in to COC.
  2. In the navigation pane on the left, choose Resilience Center > Chaos Drills. On the displayed page, click Risk Management Tasks and then the Drill Plans tab.
    Figure 5 Drill Plans tab page
  3. Click Create Drill Plan. In the displayed dialog box, set Failure Mode, Executed By, Region, and Planned Drill Time, and click OK.
    Figure 6 Creating a drill plan
  4. The executor clicks Accept in the Operation column. The page for creating a drill task is displayed. The drill task is associated with the specified failure mode and region. Moreover, you can track the progress of drill tasks.
    Figure 7 Switching to the page for creating a drill task

Step 4: Drill Task

Create a drill task on COC.

  1. Log in to COC.
  2. In the navigation pane on the left, choose Resilience Center > Chaos Drills. On the displayed page, click the Drill Tasks tab.
  3. Click Create Task.
    Figure 8 Creating a drill task
  4. Enter basic information about the drill task, including Drill Task and Expected Recovery Duration (Minutes).
    Figure 9 Basic information
  5. Select an attack task. By default, there is one attack task group. You can click Create Task Group to add a task group or click Create Attack Task to access the page for creating an attack task.
    Figure 10 Selecting an attack task
  6. On the displayed Create Attack Task page, you can select Create Attack Task or Select from Existing. If you have not created an attack task before, you will need to select Create Attack Task. However, if you have created attack tasks previously, you can select Select from Existing.
  7. Creating an attack task: Select an attack target and then an attack scenario. Different attack targets correspond to different attack scenarios. Enter the attack task name. The attack target source can be Elastic Cloud Server (ECS) or Cloud Container Engine (CCE). If you select the former, you will need to select the corresponding server from the list below and click Next.
    Figure 11 Selecting ECS as the attack target source
  8. Select an attack scenario, set attack parameters, and click OK. The scenarios include Host Resource, Host Process, and Host Network.
    Figure 12 ECS attack scenarios
  9. If you select Cloud Container Engine (CCE) as the attack target source, you will need to select an application and pod (select a cluster, namespace, workload type, and workload in sequence). You can specify pods or the number of pods, and click Next.
    Figure 13 Selecting CCE as the attack target source and specifying a pod
    Figure 14 Selecting CCE as the attack target source and specifying the quantity
  10. Select a CCE attack scenario, set attack parameters, and click OK. The scenarios include Weapons Attacking POD Instances, Weapons Attacking POD Processes, and Weapons Attacking the POD Network.
    Figure 15 CCE attack scenarios
  11. If you select Select from Existing, select the created attack task from the task list below and click OK.
    Figure 16 Selecting an existing attack task
  12. Click OK. The drill task is created.
    Figure 17 Clicking OK

Step 5: Drill Report

Once a drill is finished, you can create a drill report.

  1. Log in to COC.
  2. In the navigation pane on the left, choose Resilience Center > Chaos Drills. On the displayed page, click the Drill Tasks tab.
    Figure 18 Drill task list
  3. Locate the row containing the finished drill task and click Drill Record in the Operation column. In the displayed drill record list, locate a desired drill record, click Create Report or View Progress in the Operation column. On the displayed Drill Record Detail page, click Create Drill Report on the right.
    Figure 19 Drill record list
    Figure 20 Drill Record Detail page
  4. Go to the drill report page and update the report name.
    Figure 21 Drill report details
  5. On the drill report details page, enter the drill duration and click OK.
    Figure 22 Modify Drill Duration

Step 6: Review and Improvement

Once you have created a drill report, you can include suggestions for improvement.

  1. Log in to COC.
  2. In the navigation pane on the left, choose Resilience Center > Chaos Drills. On the displayed page, click the Drill Tasks tab.
    Figure 23 Drill task list
  3. Click Drill Record.
  4. Access the drill record list and click View Report or Create Report.
    Figure 24 Drill record list
  5. Access the drill report details page, click Create Improvement Ticket on the right, and enter information about the improvement ticket.
    Figure 25 Creating an improvement ticket
    Table 3 Improvement ticket parameters

    Parameter

    Description

    Improvement Task

    Improvement task name

    Application

    Application the improvement task belongs to

    Type

    Type of the improvement task

    Improvement Owner

    Owner of the improvement task

    Expected Completion

    Expected completion time of the improvement task

    Symptom

    Symptom

    Improvement Ticket Closure Criteria

    Criteria for the closure of the improvement ticket