Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive

E2E Chaos Engineering

Updated on 2024-04-19 GMT+08:00

Scenario

A new application for an e-commerce company has been deployed in the production environment, and they plan to officially launch it for access and traffic. However, their traditional O&M mode is mainly reactive, lacking proactive O&M concepts and tool capabilities. Before the application went live, there was no effective way to identify availability issues, and after it went live, the availability status could not be accurately grasped. The O&M team lacked emergency response capabilities and practical experience. They hope to use chaos engineering to test the application's architectural resilience in the production environment before launching it to ensure that there are no major stability risks during the official launch.

Solution

Chaos drills drive proactive O&M: Starting from the customer's actual business scenario, we provide end-to-end chaos drill capabilities based on risk analysis, contingency plans, exercise execution, and retrospective improvement.

Fault precipitation mode: We have pioneered a fault scenario analysis method based on a fault-tolerant perspective and have accumulated a library of fault modes from Huawei Cloud SRE's years of experience, which includes over 300 typical fault modes.

  • Risk analysis: Analyze the application architecture to identify risks.
  • Contingency plan: Designate contingency plans for the identified risks.
  • Fault drill: Based on the results of the risk analysis and emergency plans, specify the drill plan and conduct fault drills.
  • Review and improvement: After the drill is completed, summarize the drill and output the drill report and improvement items.

Core Advantages

  • Pioneered the FT-FMEA fault scenario analysis method based on a fault-tolerant perspective, gradually incorporating 300+ fault modes.
  • Supports multi-dimensional attack scenarios, covering both virtualization and containerization.
  • Supports custom attack process orchestration to meet individual customer business needs.

Prerequisites

An application group has been created on the application management page.

The resources for conducting chaos drills have UniAgent installed. For details, see "Installing the UniAgent".

Step 1: Failure Mode

Check whether the application to which the target host or container belongs and the incident level are correct.

  1. Log in to COC.
  2. In the navigation pane on the left, choose Resilience Center > Chaos Drills. On the displayed page, click Risk Management Tasks and then the Failure Modes tab.
    Figure 1 Failure Modes tab page
  3. Enter failure mode information.
    Figure 2 Creating a failure mode
    Table 1 Failure mode parameters

    Parameter

    Description

    Failure Mode

    Custom failure mode name

    Application

    Application the drill object belongs to

    Incident Level

    See the incident center page.

    Source

    The options are Failure modes detected proactively and Existing failure modes.

    Contingency Plan

    For details, see the contingency plan section.

    Scenario Category

    Failure scenario. The options are Redundancy, DR, Overloading, Configuration, and Dependencies.

    Occurrence Conditions

    Possible conditions that cause the failure

  4. Set Contingency Plan Available. If you select Yes, enter a contingency plan name to search for the plan, select the plan, and click Save.

Step 2: Contingency Plan

Select the application to which the target host, where the fault will be injected, belongs.

  1. Log in to COC.
  2. In the navigation pane on the left, choose Resilience Center > Chaos Drills. On the displayed page, click Risk Management Tasks and then the Contingency Plans tab.
    Figure 3 Contingency Plans tab page
  3. Enter basic information about the contingency plan.
    Figure 4 Creating a contingency plan
    Table 2 Contingency plan parameters

    Parameter

    Description

    Contingency Plan

    Custom contingency plan name

    Application

    Application to which the target host or container belongs

    Description

    Description about the contingency plan

    Contingency Plan Attachment

    Emergency recovery guide for practicing abnormal situations

  4. During the drill, unexpected abnormal situations may occur, so you should prepare emergency measures in advance and have the emergency recovery guide ready. Click Upload to upload it and then click OK.

Step 3: Drill Planning

You can designate an executor to create a drill plan. The executor creates a drill task by receiving a service ticket and associates it with a failure more and region.

  1. Log in to COC.
  2. In the navigation pane on the left, choose Resilience Center > Chaos Drills. On the displayed page, click Risk Management Tasks and then the Drill Plans tab.
    Figure 5 Drill Plans tab page
  3. Click Create Drill Plan. In the displayed dialog box, set Failure Mode, Executed By, Region, and Planned Drill Time, and click OK.
    Figure 6 Creating a drill plan
  4. The executor clicks Accept in the Operation column. The page for creating a drill task is displayed. The drill task is associated with the specified failure mode and region. Moreover, you can track the progress of drill tasks.
    Figure 7 Switching to the page for creating a drill task

Step 4: Drill Task

Create a drill task on COC.

  1. Log in to COC.
  2. In the navigation pane on the left, choose Resilience Center > Chaos Drills. On the displayed page, click the Drill Tasks tab.
  3. Click Create Task.
    Figure 8 Creating a drill task
  4. Enter basic information about the drill task, including Drill Task and Expected Recovery Duration (Minutes).
    Figure 9 Basic information
  5. Select an attack task. By default, there is one attack task group. You can click Create Task Group to add a task group or click Create Attack Task to access the page for creating an attack task.
    Figure 10 Selecting an attack task
  6. On the displayed Create Attack Task page, you can select Create Attack Task or Select from Existing. If you have not created an attack task before, you will need to select Create Attack Task. However, if you have created attack tasks previously, you can select Select from Existing.
  7. Creating an attack task: Select an attack target and then an attack scenario. Different attack targets correspond to different attack scenarios. Enter the attack task name. The attack target source can be Elastic Cloud Server (ECS) or Cloud Container Engine (CCE). If you select the former, you will need to select the corresponding server from the list below and click Next.
    Figure 11 Selecting ECS as the attack target source
  8. Select an attack scenario, set attack parameters, and click OK. The scenarios include Host Resource, Host Process, and Host Network.
    Figure 12 ECS attack scenarios
  9. If you select Cloud Container Engine (CCE) as the attack target source, you will need to select an application and pod (select a cluster, namespace, workload type, and workload in sequence). You can specify pods or the number of pods, and click Next.
    Figure 13 Selecting CCE as the attack target source and specifying a pod
    Figure 14 Selecting CCE as the attack target source and specifying the quantity
  10. Select a CCE attack scenario, set attack parameters, and click OK. The scenarios include Weapons Attacking POD Instances, Weapons Attacking POD Processes, and Weapons Attacking the POD Network.
    Figure 15 CCE attack scenarios
  11. If you select Select from Existing, select the created attack task from the task list below and click OK.
    Figure 16 Selecting an existing attack task
  12. Click OK. The drill task is created.
    Figure 17 Clicking OK

Step 5: Drill Report

Once a drill is finished, you can create a drill report.

  1. Log in to COC.
  2. In the navigation pane on the left, choose Resilience Center > Chaos Drills. On the displayed page, click the Drill Tasks tab.
    Figure 18 Drill task list
  3. Locate the row containing the finished drill task and click Drill Record in the Operation column. In the displayed drill record list, locate a desired drill record, click Create Report or View Progress in the Operation column. On the displayed Drill Record Detail page, click Create Drill Report on the right.
    Figure 19 Drill record list
    Figure 20 Drill Record Detail page
  4. Go to the drill report page and update the report name.
    Figure 21 Drill report details
  5. On the drill report details page, enter the drill duration and click OK.
    Figure 22 Modify Drill Duration

Step 6: Review and Improvement

Once you have created a drill report, you can include suggestions for improvement.

  1. Log in to COC.
  2. In the navigation pane on the left, choose Resilience Center > Chaos Drills. On the displayed page, click the Drill Tasks tab.
    Figure 23 Drill task list
  3. Click Drill Record.
  4. Access the drill record list and click View Report or Create Report.
    Figure 24 Drill record list
  5. Access the drill report details page, click Create Improvement Ticket on the right, and enter information about the improvement ticket.
    Figure 25 Creating an improvement ticket
    Table 3 Improvement ticket parameters

    Parameter

    Description

    Improvement Task

    Improvement task name

    Application

    Application the improvement task belongs to

    Type

    Type of the improvement task

    Improvement Owner

    Owner of the improvement task

    Expected Completion

    Expected completion time of the improvement task

    Symptom

    Symptom

    Improvement Ticket Closure Criteria

    Criteria for the closure of the improvement ticket

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback