Help Center/ Cloud Operations Center/ User Guide/ Overview/ O&M Situational Awareness
Updated on 2024-11-20 GMT+08:00

O&M Situational Awareness

COC provides O&M situation awareness capabilities through monitoring of changes, incidents, alarms, security compliance, service level objectives (SLOs), production readiness reviews (PRRs), and more. In this module, you can view the overall O&M situation from macro to micro on an enterprise-level O&M sandbox.

  • The dedicated O&M BI dashboard caters to various O&M roles, aiding in O&M optimization, insights, and decision-making.
  • 30+ O&M metrics are preset, presenting O&M situations of your cloud resources or applications on 7 perspective-based dashboards and a comprehensive enterprise-level O&M sandbox.
  • Organization administrators or delegated administrators can view the O&M situation data of organization member accounts across accounts, and aggregate data of multiple regions and applications across accounts.

Prerequisites

If you use the O&M situation awareness function in the single-account scenario, skip this section and see Procedure.

If you use the O&M situation awareness function across accounts, the following prerequisites must be completed:

1. Cross-account management has been enabled for the current account, and the account is an organization or delegated administrator account.

2. The COC service has been enabled for the organization member accounts of the current account.

Scenarios

View O&M situation data of your applications on COC.

Procedure

  1. Log in to COC.
  2. On the Overview page of COC, click O&M Situation Awareness.
  3. On the O&M Situation Awareness sandbox, filter the O&M data by region, application, or a specified duration as required.
  4. Filter O&M situation information by organization account, region, application, and date.

    Figure 1 Filtering data by organization account

    In the cross-account scenario, if no account is selected, the O&M situation data of the current account is displayed by default.

    Figure 2 Application data aggregation in cross-account scenarios

O&M Overview

The O&M overview page consists of four modules: overview, risk reporting, PRR summary, and top 5 incidents. The overview module enables you to observe the O&M situation from the global perspective, facilitating O&M optimization, insights, and decision-making. The risk reporting module displays the O&M statuses and risks reported through P3 or more severe incident tickets, WarRoom requests, faults triggered by changes, and critical alarms. The PRR summary module provides the review statuses of your applications before they are released or put into commercial use. The top 5 incidents module displays the top 5 incidents that have the most severe impacts on your services to help you quickly identify major fault scenarios. For details about the metrics included, see Table 1.

Figure 3 O&M overview
Table 1 Metrics in the O&M overview

Module

Metric

Data Source

Metric Definition

Calculation Rule

Statistical Period

Measurement Unit

Overview

Incidents

Incident center

Collects the trend of the incident ticket quantity.

Collect the number of incident tickets created in a selected period.

Day or month

Count

Alarms

Alarm center

Collects the alarm quantity trend.

Collect the number of alarms generated in a selected period.

Day or month

Count

War Rooms

War rooms

Collects the WarRoom request quantity trend.

Collect the number of WarRoom requests initiated in a selected period.

Day or month

Count

Monitoring Discovery Rate

Alarm center

Collects the proportion of incidents that trigger specified alarms.

Monitoring discovery rate = Number of incidents that meet the filter criteria and trigger specified alarms/Total number of incidents that meet the filter criteria

Day or month

%

Changes

Change management

Collects the change ticket quantity trend.

Collect the number of change tickets created in a selected period.

Day or month

Count

Cloud Service SLO

SLO management

Collects the change trend of the actual SLO value of a cloud service.

Cloud service SLO = 1 – (Unavailability duration of the cloud service/Total duration of the cloud service) x 100%

Day or month

%

Risk reporting

Change-triggered Incidents

Incident management

Collects the number of incidents caused by changes.

Collect the number of incident tickets whose incident type is change.

Day or month

Count

Critical Alarms in Last 7 Days

Alarm center

Collects the number of critical alarms in the last 7 days.

Collect the number of critical alarms in the last 7 days.

Last 7 days

Count

P3 or More Severe Incidents

Incident management

Calculates the number of P3 or more severe incidents.

Collect the total number of P1, P2, and P3 incidents, including unhandled incidents.

Day or month

Count

WarRoom Requests

Alarm center

Collects the number of WarRoom requests.

Collect the number of WarRoom requests initiated in a selected period.

Day or month

Count

PRR summary

PRR

PRR

Collects the number of services that are covered by a PRR.

Collect the number of services that are covered by a PRR.

Day or month

Count

PRR passing

PRR

Collects the number of services passed or failed a PRR in each PRR phase.

Collect the number of services passed or failed a PRR in each PRR phase.

Day or month

Count

Top 5 incidents

Top 5 Incidents

Incident management

Collects the top 5 most severe incidents.

Collect the number of handled P3 or more severe incidents in a specified period, rank the incidents by severity first and then by interruption duration to obtain the top 5 most severe incidents.

Day or month

Incident information

Changes

The Changes page consists of three modules: data overview, change overhead, and change risks, comprehensively displaying change statuses of your applications or cloud services using core change metrics. The data overview module encompasses various metrics, inducing change duration, success rate, and automated change rate. COC uses these metrics to present the overall change statistics of your services on change trend charts that are bolstered by required change data. The change risk module displays the faults caused by changes and provides the change success rate, as well as the change level and change method distribution charts. The change overhead module shows the trends of the labor required and time consumed by your services in a specified period so that you can control your change overhead as required. For details about the metrics included, see Table 2.

Figure 4 Changes
Table 2 Metrics on the Changes page

Metric

Data Source

Metric Definition

Calculation Rule

Statistical Period

Measurement Unit

Change-caused Incidents on the Live Network

Change management

Collects the number of change-caused incidents of each level on the live network.

Collect the number of incident tickets created for each level of incidents that are caused by changes within a selected time range.

Day or month

Count

Change Level

Change management

Collects the number of change tickets for each level of changes.

Collect the number of change tickets for each level of changes in a selected period.

Day or month

Count

Change Method

Change management

Collects the number of change tickets that employ different change methods, such as automated and manual changes, respectively.

Collect the number of change tickets for each change method.

Day or month

Count

Total Changes

Change management

Collects the number of change tickets.

Collect the number of change tickets completed in a selected period.

Day or month

Count

Change Success Rate

Change management

Collects the success rate of change tickets.

Change success rate = Number change tickets that are handled/Total number of change tickets that are handled and failed x 100%

Day or month

%

Average Change Duration

Change management

Collects the average duration for handling change tickets.

Average change duration = Total duration required by handled change tickets in a selected period/Number of handled change tickets x 100%

Day or month

ddhhmm

Automatic Change Rate

Change management

Collects the proportion of automatic changes in all change tickets.

Automatic change rate = Number of automatic changes/Total number of change tickets x 100%

Day or month

%

Change Trend

Change management

Collects the number of successful and failed changes and change success rate trend.

Collect the number of successful and failed changes and change success rate trend.

Day or month

Count

Change Manpower

Change management

Collects the number of O&M engineers required in changes.

Change labor = Number of change coordinators + Number of change implementers

Day or month

Person-time

Change Duration

Change management

Collects the average handling duration of change tickets.

Average change handling duration = Total duration required by handled change tickets in a selected period/Number of handled change tickets x 100%

Day or month

ddhhmm

Fault Management

Fault Management consists of three modules: incident statistics, WarRoom, and backtracking and improvement. These modules leverage core metrics of the entire incident management process to manage and handle incidents efficiently. Backed by metrics such as incident quantity, closure rate, handling duration, and number of damaged applications, the incident statistics module presents incident risks of your cloud services and applications on incident risk trend charts and top/bottom ranking charts with change data marked. The WarRoom module encompasses damaged applications, levels and time windows of incidents that trigger WarRoom request initiation, warning the occurrence of major fault scenarios and representing the fault handling. The backtracking and improvement module includes the fault closure rate and trend analysis of fault backtracking and improvement to ensure that experience in handling known faults is accumulated, reducing the frequency and handling duration of similar faults. For details about the metrics included, see Table 3.

Figure 5 Fault management
Table 3 Incident management data dictionary

Module

Metric

Data Source

Metric Definition

Calculation Rule

Statistical Period

Measurement Unit

Incident statistics

Total Incidents

Incident management

Collects the total number of incident tickets.

Collect the number of incident tickets created in a selected period.

Day or month

Count

Incident Level

Incident management

Collects the number of incident tickets of each type and level.

Collect the number of incident tickets of each type and level within a selected time range.

Day or month

Count

Incident Closure Rate

Incident management

Collects the closure rate incident tickets.

Incident ticket closure rate = Number of closed incident tickets within a selected time range/Total number of incident tickets x 100%

Day or month

%

Incident Duration

Incident management

Collects the average handling duration of incident tickets.

Incident handling duration = Total handling duration of closed incidents/Number of closed incidents x 100%

Day or month

ddhhmm

Affected Applications

Incident management

Collects the number of applications affected by an incident ticket.

Collect the number of affected applications (including deleted applications) of an incident ticket after deduplication.

Day or month

Count

War rooms

WarRoom Requests

War rooms

Collects the number of all WarRoom requests.

Collect the number of WarRoom requests initiated in a selected period.

Day or month

Count

Fault Level

Incident management

Collects the number of incidents of each level for a WarRoom request.

Calculate the number of incidents of each level for a war room request.

Day or month

Count

Affected Applications

War rooms

Collects the number of affected applications for a war room request.

Calculate the number of affected applications for a WarRoom request after deduplication.

Day or month

Count

Average Recovery Duration

War rooms

Collects the average duration for fault recovery from a WarRoom request.

Average WarRoom recovery duration = Total duration required by handled WarRoom requests within a selected time range/Number of handled WarRoom requests

Day or month

ddhhmm

Distribution of Handling Time Windows

War rooms

Collects the number of times WarRoom requests are initiated in each time window.

Collect the number of times WarRoom request are initiated in each time window.

Day or month

Count

Backtracking and improvement

Backtracking Tickets

Issue Management

Collects the number of backtracking tickets.

Total number of backtracking tickets in a statistical period

Day or month

Count

Closure Rate of Backtracking Tickets

Issue Management

Collects the closure rate of backtracking tickets.

Closure rate of backtracking tickets = Number of closed backtracking tickets/Total number of backtracking tickets x 100%

Day or month

%

Total Improvement Tickets

Issue Management

Collects the number of improvement tickets.

Collect the total number of improvement tickets in a statistical period.

Day or month

Count

Improvement Ticket Closure Rate

Issue Management

Collects the closure rate of improvement tickets.

Closure rate of improvement tickets = Number of closed improvement tickets/Total number of improvement tickets x 100%

Day or month

%

Monitoring and Alerting

The alerting and monitoring package displays alarm information in charts, helping O&M engineers quickly learn about the overall service status. The altering and monitoring package consists of three modules: alarm analysis, alarm costs, and alarm quality, reflecting core metrics of alarm management. Alarm analysis provides the metrics for calculating the total number of alarms, alarm severity, top 10 applications, alarm reduction, and alarm trend. By analyzing historical alarm data, the O&M supervisor can understand the trend and mode of service alarms and detect potential performance problems or potential faults. The alarm cost statistics include the alarm manpower and automatic handling rate. The O&M supervisor can effectively control the labor cost of changes based on the alarm cost. The alarm quality statistics function collects statistics on incident ticket- and war room-triggered alarm detection rates, helping O&M supervisors evaluate the validity of current alarms and optimize alarm configurations in a timely manner. For details about the metrics included, see Table 4.

Figure 6 Monitoring and alerting
Table 4 Monitoring alarm data dictionary

Module

Metric

Data Source

Metric Definition

Calculation Rule

Statistical Period

Measurement Unit

Alarm analysis

Alarms

Alarms

Collects the total number of alarms.

Collects the number of alarms generated in a selected period.

Day/Month

Count

Alarm Severity

Alarms

Collects the number of alarms of each severity.

Number of alarms of each severity within the selected time range

Day/Month

Count

Alarm Trend

Alarms

Collects the trend of the number of alarms of each severity within the selected time range.

Number of alarms of each severity within the selected time range

Day/Month

Count

Alerting Cost

Persons Involved

Alarms

Collects the number of alarm handling participants.

Number of owners (deduplicated) for integrated alarms

Day/Month

Person

Alarms Handled Per Capita

Alarms

Collects the number of alarms handled by per person.

Total number of alarms in the selected time range/Number of alarm handling participants in the selected time range

Day/Month

Person

Automatic Alarm Handling Rate

Alarms

Collects statistics on automatic alarm handling.

Number of automatically handled alarms in the selected time range/Total number of alarms x 100%

Day/Month

%

Alarm Quality

Fault alarm detection rate

Incident Management

Collects statistics on the number of incident tickets triggered by alarms.

Number of incident tickets converted from alarms in the selected time range/Total number of incident tickets in the selected time range x 100%

Day/Month

%

War Room Alarm Detection Rate

War rooms

Collects the number of war rooms triggered by alarms.

Number of war rooms triggered by incidents converted from alarms in the selected time range/War rooms

Total quantity x 100%

Day/Month

%

Alarms Reported

Alarms Reported

Alarms

Displays alarm risks reported by application.

Weighted calculation and sorting based on the severity and quantity of alarms reported for an application

Day/Month

N/A

Security Compliance

The security compliance module includes statistics on the number of scanned patches and account management data (coming soon). Patch scanning allows you to view instance compliance data by region, application, and OS, and display the number of scanned instances by time range.

Figure 7 Security compliance

Module

Metric

Data Source

Metric Definition

Calculation Rule

Statistical Period

Measurement Unit

Patch Management

Instance scanning status

Patch management/CloudCMDB

Number of ECSs where patches have been scanned and have not been scanned under a tenant account

Unscanned instances = Total instances – Scanned instances

Area and application

Count

Instance compliance status

Patch management

Number of compliant and non-compliant instances in the scanned instances

Collects statistics on the number of instances in each compliance status in patch management.

Area and application

Count

Last Scan Time

Patch management

Collect statistics on the latest scanning time range of scanned instances.

Collect statistics on the latest scanning time range of scanned instances.

Area and application

Count

Account Management

Managed Instances

Account Management

Number of managed cloud service instances in account management

Number of managed cloud service instances in account management

Area and application

Count

Management Rate

Account management

Proportion of the managed cloud service instances to all instances

Management rate = Number of managed instances/Total number of instances x 100%

Area and application

%

Managed Instance Statistics

Account management

This metric displays the instance management trend by time period.

This metric displays the instance management trend by time period.

Area and application

-

SLO Dashboard

The service level objective (SLO) dashboard covers the overall SLO achievement, application-dimension SLO statistics, and error budget management. In the Overall SLO Achievement area, you can view SLO values by year and month and the overall service level trend. In the SLO Statistics by Application area, you can view SLO values by time and application and evaluate the service level of each application. The Error Budgets module shows the error budget based on the SLO values of each application to provide guidance for changes or other high-risk operations. For details about the metrics included, see Table 5.

Figure 8 SLO dashboard
Table 5 SLO dashboard data dictionary

Module

Metric

Data Source

Metric Definition

Calculation Rule

Statistical Period

Measurement Unit

SLO achievement

Annual Expected SLO Value

SLO management

Expected SLO value of applications in a year

Expected SLO value = Expected SLO value set in the SLO management module

Expected SLO value of multiple applications = Average expected SLO value of applications

Year

%

Annual Actual SLO Value

SLO management

Collects the actual SLO achievement of an application in a year.

Actual SLO value in a year = 1 – (Annual service unavailability duration/Total application duration in a year) x 100%

Actual SLO value of multiple applications in a region = Average actual SLO value of these applications in a year

Actual SLO value of an application in several regions in a year = Minimum actual SLO value of the application in multiple regions in a year

Actual SLO value of multiple applications in multiple regions = Average actual SLO value of these applications in multiple regions in a year

Day or month

%

Applications That Do Not Meet Exceptions

SLO management

Collects the number of applications that do not meet SLO expectations.

Calculate the number of applications that fail to achieve the SLO expectation. If all regions are selected and the actual SLO value of applications in any region in a year is less than the annual expected SLO value, the SLO exception is not met.

Day or month

Count

Monthly Expected SLO Value

SLO management

Collects the expected SLO achievement of an application in a month.

Expected SLO value = Expected SLO value set in the SLO management module

Expected SLO value of multiple applications = Average expected SLO value of applications

Day or month

%

Monthly Actual SLO Value

SLO management

Collects the actual SLO achievement in a month.

Actual SLO value in a month = 1 – (Monthly service unavailability duration/Total service duration in a month) x 100%

Actual monthly SLO value of multiple applications in a region = Average actual SLO value of these applications in a month

Actual SLO value of an application in several regions = Minimum actual SLO value of the application in multiple regions in a month

Actual SLO value of multiple applications in multiple regions = Average actual SLO value of these applications in multiple regions in a year

Day or month

%

SLO statistics by application

SLO statistics by application

SLO management

Collects SLO statistics by application.

Collect the monthly SLO actual value by application.

Actual SLO value in a month = 1 – (Monthly service unavailability duration/Total service duration in a month) x 100%

Actual SLO value of an application in several regions in a month = Minimum actual SLO value of the application in multiple regions in a month

Day or month

%

Error budgets

Error Budgets

SLO management

Measures the difference between the actual performance and the expected performance and provides the error budgets.

If the actual SLO value is greater than the expected SLO value:

Error budgets = (Actual annual SLO value – Expected annual SLO value) x Total service duration in a year (minutes)

If the actual SLO value is less than or equal to the expected SLO value, the error budget is 0.

Day or month

Minute

PRR Dashboard

The PRR dashboard encompasses the review service summary, evaluation radar distribution, service review, and improvement task closure. The review service summary module shows the review phase of each service before the service is put into production and the review status. The evaluation radar distribution module shows the distribution of review items that do not meet service requirements. The service review and improvement module presents the rectification statuses of the items that do not meet the review requirements. For details about the metrics included, see Table 6.

Figure 9 PRR dashboard
Table 6 PRR dashboard data dictionary

Module

Metric

Data Source

Metric Definition

Calculation Rule

Statistical Period

Measurement Unit

Service PRR Summary

Total Review Services

PRR

Collects the number of services that are included in the PRR.

Collect the total number of services are covered by the PRR within a selected time range.

Day or month

Count

Service PRR summary

PRR

Collects the number of services that are included in each PRR phase and the approval status.

Collect the number of sources included in each PRR phase and the approval status within a selected time range.

Day or month

Count

Evaluation radar distribution chart

Evaluation radar distribution

PRR

Collects the distribution of PRR items that fail to be met.

Collect the number of review items that are not met in a selected time range.

Day or month

Count

Service review

Services to Be Reviewed

PRR

Collects the total number of services to be reviewed and the approval status.

Collect the total number of services to be reviewed and service approval status within a selected time range.

Day or month

Count

Closure of improvement tasks

Task Closure Statistics

PRR

Collects the number of improvement tasks and their closure statuses.

Collect the number of improvement tasks and the closure statuses of the tasks within a selected time range.

Day or month

Count

Improvement Tasks

PRR

Collects the number of improvement tasks in each dimension and their closure statuses.

Collect the number of improvement tasks by review item and the closure statuses of these tasks.

Day or month

Count