Updated on 2025-09-11 GMT+08:00

Creating and Managing Drill Tasks

Scenarios

Drill tasks allow you to simulate software or hardware faults to test the system's fault recovery capability. Drill task operations include managing chaos drill tasks, viewing drill records, and creating drill tasks. Setting a drill task include setting the basic information, adding an attack task group, selecting an attack task, and selecting an attack scenario. In addition, a drill task involves monitoring task configuration and post-drill review and improvement. This ensures that an excellent optimization policy can be applied when the system is under various pressures.

Automatic Task Termination Mechanism

  • Automatic termination upon timeout: If a drill task fails and you do not manually close the task within 48 hours, the system automatically terminates the drill task.
  • Automatic termination upon exceptions: During the drill, if a pod exception (for example, the pod has been deleted) is detected or a resource O&M ticket is manually closed, the system automatically terminates the current task immediately.

Creating a Drill Task

  1. Log in to COC.
  2. In the navigation tree on the left, choose Resilience Center > Chaos Drills.
  3. Click the Drill Tasks tab.
  4. Click Create Task.

    You can also use the drill plan ticket accepting function to access the page for creating a drill task. For details, see Creating and Managing Drill Plans.

  5. Configure the basic information.

    Table 1 Parameters in the basic information

    Parameter

    Description

    Example Value

    Drill Task

    Name of the drill task. Set it according to the naming rules.

    test-drill

    Expected Recovery Duration (Minutes)

    Expected time from the fault occurrence to the fault recovery, in minutes.

    Expected time for an application to automatically recover to the normal state during emergency plan execution after a fault is injected. This time does not affect the drill task.

    3

  6. Click Add Attack Task.

    By default, there is one attack task group. You can click Add Task Group to add a task group. After adding an attack task, you can click Add Attack Task to add another attack task.
    • Tasks in different task groups are executed in serial mode, and tasks in the same task group are executed in parallel mode.
    • Currently, multiple fault injection operations on the same resource in a task group are not supported.
    1. Set parameters for adding an attack task.
      • To add an existing task, click Select from Existing, select the existing task, and click OK.
      • To add a new attack task, perform the following steps.
        Table 2 Parameters for adding an attack task

        Parameter

        Description

        Example Value

        Vendor

        Select a cloud vendor type.

        Huawei Cloud

        Source of Attack Target

        Select the source of the target instance.

        You can select attack targets by selecting instances, pods, or a specified number of targets if CCE instances are used.

        Elastic Cloud Server (ECS)

        Attack Task

        Customize the name of the attack task based on the naming rule.

        test-attacktask

        Attack Target

        Select the target instance.

        -

    2. Click Next.
    3. Set parameters for selecting an attack scenario.
      For details, see Attack Scenarios.
      Table 3 Parameters for selecting an attack scenario

      Parameter

      Description

      Example Value

      Attack Type

      Attack scenarios are classified based on attack scenario types.

      Host Resource

      Attack Scenario

      Select an attack scenario.

      CPU usage increase

      Attack Parameters

      Configure attack parameters based on attack scenarios.

      CPU Usage (%): 80

      Fault Duration (s): 60

    4. Click Next.
    5. (Optional) Set Configure Monitoring Tasks.
      Table 4 Parameters for configuring a monitoring task

      Parameter

      Description

      Steady-State Metrics

      Select the target resource, performance metric, lower limit, and upper limit from the drop-down lists one by one.

      If a service can perform well and stably when a performance monitoring metric is set to a certain value range, this metric is called stable-status metric. If this metric value is not in that value range before a drill, the drill will be canceled.

      Metric

      Select the target resource, monitoring metric, lower limit, and upper limit from the drop-down lists one by one.

      These service metrics monitor the corresponding service data during fault drills. If the value of such a metric is within the allowed value range, the service is normal. Otherwise, you can determine whether to stop a drill.

      Automatic Rollback

      Select whether to enable automatic rollback.

      Fault injection is automatically rolled back and restored to the status before fault injection. Automatic rollback cannot be configured for some disruptors for fault drills that do not support fault termination.

      If the value of a steady-state metric is not within the stable value range during a drill, the corresponding fault injection automatically stops after automatic rollback is enabled.

    6. Click Finish. The attack task is added.

  7. Click OK. The drill task is created and the task status is to be drilled.

    If you click Save as Draft, the task status is draft, which does not allow you to start the drill task.

Modifying a Drill Task

If drill records have been generated for a drill task, the task cannot be modified.

  1. Log in to COC.
  2. In the navigation tree on the left, choose Resilience Center > Chaos Drills.
  3. Click the Drill Tasks tab.
  4. Locate the drill task you want to modify and click More in the Operation column and choose Modify.
  5. Modify the drill task based on the requirement scenario.

    • Click Add Task Group.
    • Click Add Attack Task.
    • Click Delete in the row of a task to delete the attack task.

  6. Click OK.

    The drill task is modified.

Deleting a Drill Task

If a created drill task is no longer needed, you can delete it. Do not delete the drill task in the following scenarios:

  • The drill task has generated drill records.
  • The drill task is associated with a drill plan.
  1. Log in to COC.
  2. In the navigation tree on the left, choose Resilience Center > Chaos Drills.
  3. Click the Drill Tasks tab.
  4. Locate the drill task you want to delete and click More in the Operation column and choose Delete.
  5. Click OK.

    The drill task is deleted.

Starting a Drill Task

Start a drill task.

  1. Log in to COC.
  2. In the navigation tree on the left, choose Resilience Center > Chaos Drills.
  3. Click the Drill Tasks tab.
  4. Locate the drill task you want to start and click Start in the Operation column.
  5. Click OK.

    The drill starts. On the drill details page, you can view the attack progress, including installing probes, performing drills, and clearing the environment. The system automatically performs the drill task. The execution time depends on the attack time of the disruptor.

    In the probe installation step, a probe will be installed on the target machine. The probe runs in the system to receive disruptor commands for attack, query, and clearance. After the drill is complete or terminated, the environment clearing step stops all operations in the system and is removed.

  6. For drill execution, the following operations are supported:

    • Terminate: During a drill, click Terminate in the upper right corner to stop the task to be executed or the task that is abnormal.
    • Forcible termination: You are advised to use the termination function first. If the termination function fails, you can use the forcible termination function after 5 to 10 minutes. Note that the forcible termination function only closes the current drill ticket and does not automatically clear data in the environment. You need to manually clear data in the environment. For details, see Manually Clearing Data in the Environment.
    • Retry: If some or all attack tasks fail to check instances, install probes, clear environments, or perform steady-state detection, or the drill times out, expand the failed attack task and click Retry to retry the task.
    • Skip: If some or all attack tasks fail to be executed during the drill, expand a failed attack task and click Skip to skip the task and execute the next task.
    • Details: Expand an attack task and click Details to view the attack details.

Viewing Drill Records

View the drill records of a drill task. A drill task that has not been drilled does not contain drill record.

  1. Log in to COC.
  2. In the navigation tree on the left, choose Resilience Center > Chaos Drills.
  3. Click the Drill Tasks tab.
  4. Locate the target drill task and click Drill Record in the Operation column.

    The basic information about the drill task includes the drill task name, drill task ID, attack details, and failure mode. All drill records include the drill record ID, execution status, executor, drill start time, and drill end time.

  5. Locate the drill record to be viewed and click View Progress in the Operation column.

    View the attack progress, attack details, and monitoring details of the current drill task.

    • The drill record module displays attack task details, including the progress, task information, and execution time.
    • The attack details module displays the attack status of instances in the application of the current task. BMSs, FlexusL (HCSS) instances, and CSS instances are not supported.
    • The monitoring details module displays real-time monitoring data of attack targets. You need to configure a drill monitoring task when creating an attack task. The screen can be zoomed in to the horizontal full-screen mode.

  6. Click Drill Report on the right.

    Create or view a drill report. For details, see Creating a Drill Report.

Manually Clearing Data in the Environment

Forcibly terminating a drill task only closes the current drill ticket and does not automatically clear data in the environment. You need to manually clear the data.

The clearing methods vary depending on probes. For details, see Table 5.

Table 5 Manually clearing data in the environment

Probe Type

Disruptor Type

How to Clear

CFE

All

1. Log in to the host.

2. Switch to the /usr/local/cdr_probe and /usr/local/COC-CDR-Probe directories.

3. Delete all files in the directories.

CSS

All

No residual data exists.

DCS

DCS_REDIS_AZSHUTDOWN (AZ power-off of DCS instances)

Start the DCS instance by referring to DCS Guide.

Other

No residual data exists.

DDS

All

No residual data exists.

Platform

All

No residual data exists.

RDS

RDS_FAILOVER (switching over RDS primary and standby nodes)

No residual data exists.

RDS_SHUTDOWN (stopping an RDS instance)

Start the DCS instance by referring to RDS Guide.

Script

SCRIPT_FAULT (custom scripts)

1. Cancel the ticket that is not closed (choose Task Management > Execution Records > Script Tickets).

2. Execute the clean method in the custom script.

CCE

All

Delete the namespace and related components from the Kubernetes cluster.

  • Delete the ClusterRoleBinding whose name is cdrprobe.
  • Delete the ClusterRole whose name is cdrprobe.
  • Delete the OperatorCrd whose name is cdrprobes.cdrprobe.io.
  • Delete the Namespace whose name is cdrprobe.