Creating and Managing Drill Tasks
Scenarios
Drill tasks allow you to simulate software or hardware faults to test the system's fault recovery capability. Drill task operations include managing chaos drill tasks, viewing drill records, and creating drill tasks. Setting a drill task include setting the basic information, adding an attack task group, selecting an attack task, and selecting an attack scenario. In addition, a drill task involves monitoring task configuration and post-drill review and improvement. This ensures that an excellent optimization policy can be applied when the system is under various pressures.
Automatic Task Termination Mechanism
- Automatic termination upon timeout: If a drill task fails and you do not manually close the task within 48 hours, the system automatically terminates the drill task.
- Automatic termination upon exceptions: During the drill, if a pod exception (for example, the pod has been deleted) is detected or a resource O&M ticket is manually closed, the system automatically terminates the current task immediately.
Creating a Drill Task
- Log in to COC.
- In the navigation tree on the left, choose Resilience Center > Chaos Drills.
- Click the Drill Tasks tab.
- Click Create Task.
     
     You can also use the drill plan ticket accepting function to access the page for creating a drill task. For details, see Creating and Managing Drill Plans. 
- Configure the basic information.
     
     Table 1 Parameters in the basic information Parameter Description Example Value Drill Task Name of the drill task. Set it according to the naming rules. test-drill Expected Recovery Duration (Minutes) Expected time from the fault occurrence to the fault recovery, in minutes. Expected time for an application to automatically recover to the normal state during emergency plan execution after a fault is injected. This time does not affect the drill task. 3 
- Click Add Attack Task.
     
     By default, there is one attack task group. You can click Add Task Group to add a task group. After adding an attack task, you can click Add Attack Task to add another attack task.  - Tasks in different task groups are executed in serial mode, and tasks in the same task group are executed in parallel mode.
- Currently, multiple fault injection operations on the same resource in a task group are not supported.
 - Set parameters for adding an attack task. 
       - To add an existing task, click Select from Existing, select the existing task, and click OK.
- To add a new attack task, perform the following steps. 
         Table 2 Parameters for adding an attack task Parameter Description Example Value Vendor Select a cloud vendor type. Huawei Cloud Source of Attack Target Select the source of the target instance. You can select attack targets by selecting instances, pods, or a specified number of targets if CCE instances are used. Elastic Cloud Server (ECS) Attack Task Customize the name of the attack task based on the naming rule. test-attacktask Attack Target Select the target instance. - 
 
- Click Next.
- Set parameters for selecting an attack scenario. 
       For details, see Attack Scenarios.Table 3 Parameters for selecting an attack scenario Parameter Description Example Value Attack Type Attack scenarios are classified based on attack scenario types. Host Resource Attack Scenario Select an attack scenario. CPU usage increase Attack Parameters Configure attack parameters based on attack scenarios. CPU Usage (%): 80 Fault Duration (s): 60 
- Click Next.
- (Optional) Set Configure Monitoring Tasks. 
       Table 4 Parameters for configuring a monitoring task Parameter Description Steady-State Metrics Select the target resource, performance metric, lower limit, and upper limit from the drop-down lists one by one. If a service can perform well and stably when a performance monitoring metric is set to a certain value range, this metric is called stable-status metric. If this metric value is not in that value range before a drill, the drill will be canceled. Metric Select the target resource, monitoring metric, lower limit, and upper limit from the drop-down lists one by one. These service metrics monitor the corresponding service data during fault drills. If the value of such a metric is within the allowed value range, the service is normal. Otherwise, you can determine whether to stop a drill. Automatic Rollback Select whether to enable automatic rollback. Fault injection is automatically rolled back and restored to the status before fault injection. Automatic rollback cannot be configured for some disruptors for fault drills that do not support fault termination. If the value of a steady-state metric is not within the stable value range during a drill, the corresponding fault injection automatically stops after automatic rollback is enabled. 
- Click Finish. The attack task is added.
 
- Click OK. The drill task is created and the task status is to be drilled.
     
     If you click Save as Draft, the task status is draft, which does not allow you to start the drill task. 
Modifying a Drill Task
If drill records have been generated for a drill task, the task cannot be modified.
- Log in to COC.
- In the navigation tree on the left, choose Resilience Center > Chaos Drills.
- Click the Drill Tasks tab.
- Locate the drill task you want to modify and click More in the Operation column and choose Modify.
- Modify the drill task based on the requirement scenario.
    
    - Click Add Task Group.
- Click Add Attack Task.
- Click Delete in the row of a task to delete the attack task.
 
- Click OK.
    
    The drill task is modified. 
Deleting a Drill Task
If a created drill task is no longer needed, you can delete it. Do not delete the drill task in the following scenarios:
- The drill task has generated drill records.
- The drill task is associated with a drill plan.
- Log in to COC.
- In the navigation tree on the left, choose Resilience Center > Chaos Drills.
- Click the Drill Tasks tab.
- Locate the drill task you want to delete and click More in the Operation column and choose Delete.
- Click OK.
    
    The drill task is deleted. 
- Log in to COC.
- In the navigation tree on the left, choose Resilience Center > Chaos Drills.
- Click the Drill Tasks tab.
- Locate the drill task you want to start and click Start in the Operation column.
- Click OK.
    
    The drill starts. On the drill details page, you can view the attack progress, including installing probes, performing drills, and clearing the environment. The system automatically performs the drill task. The execution time depends on the attack time of the disruptor.  In the probe installation step, a probe will be installed on the target machine. The probe runs in the system to receive disruptor commands for attack, query, and clearance. After the drill is complete or terminated, the environment clearing step stops all operations in the system and is removed. 
- For drill execution, the following operations are supported:
    
    - Terminate: During a drill, click Terminate in the upper right corner to stop the task to be executed or the task that is abnormal.
- Forcible termination: You are advised to use the termination function first. If the termination function fails, you can use the forcible termination function after 5 to 10 minutes. Note that the forcible termination function only closes the current drill ticket and does not automatically clear data in the environment. You need to manually clear data in the environment. For details, see Manually Clearing Data in the Environment.
- Retry: If some or all attack tasks fail to check instances, install probes, clear environments, or perform steady-state detection, or the drill times out, expand the failed attack task and click Retry to retry the task.
- Skip: If some or all attack tasks fail to be executed during the drill, expand a failed attack task and click Skip to skip the task and execute the next task.
- Details: Expand an attack task and click Details to view the attack details.
 
Viewing Drill Records
View the drill records of a drill task. A drill task that has not been drilled does not contain drill record.
- Log in to COC.
- In the navigation tree on the left, choose Resilience Center > Chaos Drills.
- Click the Drill Tasks tab.
- Locate the target drill task and click Drill Record in the Operation column.
    
    The basic information about the drill task includes the drill task name, drill task ID, attack details, and failure mode. All drill records include the drill record ID, execution status, executor, drill start time, and drill end time. 
- Locate the drill record to be viewed and click View Progress in the Operation column.
    
    View the attack progress, attack details, and monitoring details of the current drill task. - The drill record module displays attack task details, including the progress, task information, and execution time.
- The attack details module displays the attack status of instances in the application of the current task. BMSs, FlexusL (HCSS) instances, and CSS instances are not supported.
- The monitoring details module displays real-time monitoring data of attack targets. You need to configure a drill monitoring task when creating an attack task. The screen can be zoomed in to the horizontal full-screen mode.
 
- Click Drill Report on the right.
    
    Create or view a drill report. For details, see Creating a Drill Report. 
Manually Clearing Data in the Environment
Forcibly terminating a drill task only closes the current drill ticket and does not automatically clear data in the environment. You need to manually clear the data.
The clearing methods vary depending on probes. For details, see Table 5.
| Probe Type | Disruptor Type | How to Clear | 
|---|---|---|
| CFE | All | 1. Log in to the host. 2. Switch to the /usr/local/cdr_probe and /usr/local/COC-CDR-Probe directories. 3. Delete all files in the directories. | 
| CSS | All | No residual data exists. | 
| DCS | DCS_REDIS_AZSHUTDOWN (AZ power-off of DCS instances) | Start the DCS instance by referring to DCS Guide. | 
| Other | No residual data exists. | |
| DDS | All | No residual data exists. | 
| Platform | All | No residual data exists. | 
| RDS | RDS_FAILOVER (switching over RDS primary and standby nodes) | No residual data exists. | 
| RDS_SHUTDOWN (stopping an RDS instance) | Start the DCS instance by referring to RDS Guide. | |
| Script | SCRIPT_FAULT (custom scripts) | 1. Cancel the ticket that is not closed (choose Task Management > Execution Records > Script Tickets). 2. Execute the clean method in the custom script. | 
| CCE | All | Delete the namespace and related components from the Kubernetes cluster. 
 | 
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot 
    