Help Center/ Cloud Operations Center/ Best Practices/ Standardized Fault Management
Updated on 2024-04-19 GMT+08:00

Standardized Fault Management

Scenario

The incident handling process of a certain intelligent customer service O&M engineer is inefficient due to the lack of standardized accident handling procedures, clear fault recovery joint collaboration teams, and contingency plans. Similar fault scenarios repeatedly occur, no O&M experience is accumulated, and deterministic fault scenarios cannot be automatically restored. There are multiple severities of alarms, but the processing of alarms lacks standardized procedures and is relatively slow. It is necessary to establish a standardized incident process to achieve standardized processing.

Solution

End-to-end incident handling process: Clearly define standardized incident handling procedures, achieve multi-operational collaboration through WarRoom requests, and improve incident handling efficiency through response plans.

COC helps users manage alarms uniformly by setting up incident forwarding rules to convert raw alarms into incident or alarm tickets. When a raw alarm matches the incident forwarding rules, an incident/alarm is created, and the corresponding owner is notified according to the scheduling management. The owner can handle the alarm or convert it into an incident. After locating and restoring the issue, the alarm is cleared. If the alarm cannot be cleared, it can be escalated to an incident or handled through WarRoom requests. This creates a standardized alarm handling process to avoid abnormal alarm handling.

The standardized incident handling process includes the following steps:

  1. Integrate and manage access to raw alarm data.
  2. Configure incident forwarding rules to clean and process alarms.
  3. Configure notification templates, select notification objects and methods in the notification management according to the notification scenario.
  4. Handle or convert alarms in the integrated alarm system.
  5. The incident center handles alarms that are converted into incidents, which can be forwarded, escalated, deescalated, or handled through WarRoom requests.

Prerequisites

An application group has been created on the application management page.

Personnel information has been added on the personnel management page.

A shift has been created on the scheduling management page.

Step 1: Integrate and Manage Access to Raw Alarm Data

  1. Log in to COC.
  2. In the navigation tree on the left, choose Incident Management > Data Source Integration.
  3. On the displayed page, select the data source to be accessed based on service requirements and click Access integration.
    Figure 1 Clicking Access integration
  4. On the displayed page, copy the endpoint URL.
    Figure 2 Endpoint URL
  5. Switch to the SMN console. In the navigation pane on the left, choose Topic Management > Subscriptions. On the displayed page, locate a desired description and click Add Subscription in the Operation column. In the displayed dialog box, click Select Topic to select a topic for Topic Name, set Protocol to HTTPS, paste the copied endpoint URL to Endpoint, and click OK.
    Figure 3 Add Subscription
  6. Log in to the Cloud Eye console. In the navigation pane on the left, choose Alarm Management > Alarm Rules. On the displayed page, click Create Alarm Rule, enable Alarm Notification, and select Topic subscription for Notification Recipient.
    Figure 4 Creating an alarm
  7. Return to COC, confirm the integration, and click Integrate.
    Figure 5 Clicking Integrate

Step 2: Create a Forwarding Rule to Clean Raw Alarm Data

  1. Log in to COC.
  2. In the navigation pane on the left, choose Incident Management > Incident Forwarding Rules.
  3. In the upper part of the list, click Create Incident Forwarding Rule.
    Figure 6 Creating a forwarding rule
  4. Enter basic information such as the rule name and application name as prompted.
  5. In the Trigger Rules area, select a trigger type, select a monitoring source for Data Source, and set triggering conditions and trigger criteria.
    Figure 7 Trigger criteria
  6. You can configure a response plan for the corresponding incident or alarm in the forwarding rule. You can select scripts or jobs.
    Figure 8 Response plan
  7. In the Assignment Details area, select an owner and click Submit.
    Figure 9 Assignment Details area

Step 3: Configure the Notification Scenario, Recipient, and Method

  1. Log in to COC.
  2. In the navigation pane on the left, choose Basic Configurations > Notification Management. On the displayed page, click Create Notification.
    Figure 10 Clicking Create Notification
  3. In the displayed dialog box, set the parameters based on Table 1 and click OK.
    Figure 11 Clicking OK
    Table 1 Notification parameters

    Parameter

    Mandatory

    Radio Button/Checkbox

    Description

    Name

    Yes

    /

    Name of a notification instance. Fuzzy search can be performed based on the notification name.

    Type

    Yes

    Radio button

    Level-1 category of incident notifications, which is classified by application type.

    Template

    Yes

    Checkbox

    Notification content template, which is built in the system. The template list varies depending on the notification type. After a template is selected, the template is displayed when you hover the cursor over it.

    Notification Scope

    Yes

    Checkbox

    When you select a service, such as Service A, and the incident ticket also indicates Service A without considering other matching rules, the subscription instance will take effect and notifications will be sent based on that subscription instance.

    Recipient

    Yes

    If you select Shift, you can select a single scenario and multiple roles. If you select Individual, you can select multiple users.

    Recipient who receives notifications. When set to Shift, the notification module will automatically retrieve a list of personnel under the current schedule and send notifications to the corresponding individuals. When set to Individual, notifications will be sent directly to the corresponding individuals.

    Notification Rule

    /

    /

    For example, if the value of rule A is set to a, in an incident ticket, the value of rule A is a, not considering other matching rules, the subscription instance will take effect and a notification is sent based on the subscription instance. However, if the value of rule A in the incident ticket is b, the subscription instance will not take effect, and no notification is sent.

    Notification Rule - Level

    No

    Checkbox

    Level of an incident ticket. There are five levels: P1 to P5. For details about the incident ticket levels, see Incident Levels.

    Notification Rule - Incident Category

    No

    Checkbox

    Category of an incident ticket. Multiple options are available.

    Notification Rule - Source

    No

    Checkbox

    Source of an incident ticket. Manual creation indicates that the incident ticket is created in the incident ticket center. Transfer creation indicates that the incident ticket is generated during the transfer.

    Notification Rule - Region

    No

    Checkbox

    Region of an incident ticket. Multiple regions can be selected.

    Method

    Yes

    Checkbox

    Notification channel

Step 4: Handle Alarms

  1. Log in to COC.
  1. In the navigation pane on the left, choose Incident Management > Alarms.
  2. In the alarm list, clear alarms, convert alarms to incidents, handle alarms, and view historical alarms.
    Figure 12 Alarm list
  3. On the automatic alarm handling page, you can select scripts or jobs and select target instances for automatic alarm handing.
    Figure 13 Automatic alarm handling
  4. Click Convert Alarms to Incidents. In the displayed dialog box, set fields such as Application, Incident Level, and Owner, and click OK. The system will send notifications to the owner according to the notification rule.
    Figure 14 Converting an alarm to an incident
  5. Click Clear to clear the current alarm. The notification alarm is then displayed on the Historical Alarms tab.
    Figure 15 Clearing alarms

Step 5: Convert Alarms to Incidents

  1. Log in to COC.
  2. In the navigation pane on the left, choose incident Management > Incident Center. On the displayed page, click the Pending tab and click the incident ticket number to access the incident details page.
    Figure 16 Clicking an incident ticket number
  3. Click Acknowledge.
    Figure 17 Clicking Acknowledge
  4. Click Change Owner.
    Figure 18 Clicking Change Owner
  5. Enter the forwarding information and click Submit.
    Figure 19 Entering forwarding information
  6. Click Upgrade/Downgrade.
    Figure 20 Clicking Upgrade/Downgrade
  7. Enter the upgrade or downgrade information and click Submit.
    Figure 21 Entering upgrade and downgrade information
  8. Click Start WarRoom.
    Figure 22 Clicking Start WarRoom
  9. Enter war room information and click Submit.
    Figure 23 Entering war room information
  10. Click Handle Incident.
    Figure 24 Clicking Handle Incident
  11. Enter incident handling information and click Submit.
    Figure 25 Entering incident handling information
  12. Click Verify Incident Closure.
    Figure 26 Clicking Verify Incident Closure
  13. Enter verification information and click OK.
    Figure 27 Entering verification information