Help Center/ Cloud Operations Center/ User Guide/ Overview/ Viewing O&M Metrics on O&M BI Dashboards
Updated on 2025-08-08 GMT+08:00

Viewing O&M Metrics on O&M BI Dashboards

Prerequisites

If you use O&M BI dashboard in single-account scenarios, skip this section and go to O&M BI Dashboards.

If you use O&M BI dashboards in cross-account scenarios, ensure that the following conditions are met:

1. Cross-account management has been enabled for the current account, and the account is an organization or delegated administrator account.

2. The COC service has been enabled for the organization member accounts of the current account.

Scenarios

This module provides O&M BI dashboards with diverse metrics, including those for monitoring incidents, alarms, security compliance, service level objectives (SLOs), and production readiness reviews (PRRs). It offers an enterprise-grade O&M sandbox to help you understand your overall O&M situation from both bird's eye and ground level views.

  • The dedicated O&M BI dashboard caters to various O&M personnel, aiding in O&M optimization, insights, and decision-making.
  • COC provides 30+ preset O&M metrics, delivering insights into your cloud resources across seven-perspective BI dashboards and a comprehensive enterprise-grade O&M sandbox.
  • Organization administrators or delegated administrators can view the O&M situation data of organization member accounts across accounts, and aggregate data of multiple regions and applications across accounts.

O&M BI Dashboards

  1. Log in to COC.

    Choose Overview.

  2. Click O&M BI in the upper right corner.

    The O&M BI page is displayed. Filter O&M situation information by organization account, region, application, and date.
    Figure 1 Filtering data by organization account

    In the cross-account scenario, if no account is selected, the O&M situation data of the current account is displayed by default.

    Except the raw alarm data, other O&M BI dashboard data is updated at T+1.

    Figure 2 Application data aggregation in cross-account scenarios

O&M Overview

The O&M Overview page consists of overview data, risks detected, PRR statistics, and top 5 faults. The overview module enables you to observe the O&M situation from the global perspective, facilitating O&M optimization, insights, and decision-making. The risks detected module displays the number of risk faults, war rooms, and change-caused faults. The PRR statistics module provides the review statuses of your applications before they are released or put into commercial use. The top 5 faults module displays the incidents that have the most severe impacts on your services to help you quickly identify major fault scenarios. For details about the metrics included, see Table 1.

Figure 3 O&M overview
Table 1 Metrics on the O&M Overview page

Module

Metric

Data Source

Metric Definition

Calculation Rule

Statistical Period

Unit

Overview

Incident

Incident center

Collects the trend of the incident ticket quantity.

Collect the number of incident tickets created in a selected period.

Day or month

Count

Alarm

Alarm center

Collects the trend of the aggregated alarm quantity.

Collect the number of aggregated alarms created in a selected period.

Day or month

Count

Issue

Issue management

Collects the trend of the number of issue tickets created

Collect the number of issue tickets created in a selected period.

Day or month

Count

WarRoom

War rooms

Collects the war room quantity trend.

Collect the number of all war rooms set up in a selected period.

Day or month

Count

Change

Change management

Collects the trend of the number of change tickets created

Collect the number of change tickets created in a selected period.

Day or month

Count

Service SLO

SLO management

Collects the change trend of the actual SLO value of a cloud service.

Actual SLO value = 1 – (Service unavailability duration/Total cloud service duration) × 100%

Day or month

%

Risks Detected

Change-caused Faults

Incident management

Collects the number of incidents caused by changes.

Collect the number of incident tickets whose incident type is change operation.

Day or month

Count

Risk Faults

Incident management

Calculates the number of P3 or more severe incidents.

Collect the total number of P1, P2, and P3 incidents, including unhandled incidents.

Day or month

Count

WarRoom

Alarm center

Collects the number of war rooms.

Collect the number of all war rooms set up in a selected period.

Day or month

Count

PRR Statistics

Applications covered by PRR

PRR

Collects the number of applications that are covered by a PRR.

Collect the number of applications that are covered by a PRR.

Day or month

Count

Applications Passed in PRR Statistics

PRR

Collects the number of services passed or failed a PRR in each PRR phase.

Collect the number of services passed or failed a PRR in each PRR phase.

Day or month

Count

Top 5 Faults

Top 5 Faults

Incident management

Collects the top 5 most severe incidents.

Collect the number of handled P3 or more severe incidents in a specified period, rank the incidents by severity first and then by interruption duration to obtain the top 5 most severe incidents.

Day or month

Incident information

Change Management

The Changes page consists of change overview, change overhead, and change risks, which are key change metrics. The change overview area displays key change metrics, such as the average change duration and success rate, and provides Period over Period (PoP) data and trend charts to show the overall change status. The change risks area displays the faults caused by changes and provides the change level distribution charts. The change overhead area displays the trends of the persons required and time consumed by your services in a specified period so that you can control your change overhead as required. For details about the metrics included, see Table 2.

Figure 4 Change management
Table 2 Metrics on the Change page

Module

Metric

Data Source

Metric Definition

Calculation Rule

Statistical Period

Unit

Change Overview

Total Changes

Change management

Collects the number of change tickets.

Collect the number of change tickets completed in a selected period.

Day or month

Count

Change Success Rate

Change management

Collects the success rate of change tickets.

Metric value = Number change tickets that are handled/Total number of change tickets that are handled and failed × 100%

Day or month

%

Average Change Duration

Change management

Collects the average duration for handling change tickets.

Metric value = Total duration required by handled change tickets in a selected period/Number of handled change tickets × 100%

Day or month

ddhhmm

Change Trend

Change management

Collects the number of successful and failed changes and change success rate trend.

Collect the number of successful and failed changes and change success rate trend.

Day or month

Count

Change Overhead

Change Manpower

Change management

Collects the number of O&M engineers required in changes.

Collect the number of change coordinators and the number of change implementers.

Day or month

Person-time

Change Duration

Change management

Collects the average handling duration of change tickets.

Metric value = Total duration required by handled change tickets in a selected period/Number of handled change tickets × 100%

Day or month

ddhhmm

Change Risks

Change-caused Live-Network Faults

Change management

Collects the number of change-caused incidents of each level on the live network.

Collect the number of incident tickets created for each level of incidents that are caused by changes within a selected time range.

Day or month

Count

Change Level

Change management

Collects the number of change tickets for each level of changes.

Collect the number of change tickets for each level of changes in a selected period.

Day or month

Count

Fault Management

The Fault Management page consists of incident statistics, war rooms, and problem improvement, which are key metrics of incident management. The incident statistics area displays incident quantity, closure rate, rectification duration, affected applications, and SLA fulfillment the rate. Incident risks are analyzed in forms of PoP changes, trend charts, and top or bottom incidents. The WarRoom area displays affected applications, incident level statistics, and handling time window distribution, which are key metrics for the occurrence and improvement of major faults. The problem improvement area includes the issue closure rate and trend analysis of problem improvement to ensure that experience in handling known faults is accumulated, reducing the frequency and handling duration of similar faults. For details about the metrics included, see Table 3.

Figure 5 Fault management page

Table 3 Metrics on the Incident Management page

Module

Metric

Data Source

Metric Definition

Calculation Rule

Statistical Period

Unit

Incident statistics

Total Incidents

Incident management

Collects the total number of incident tickets.

Collect the number of incident tickets created in a selected period.

Note: Incident tickets in the draft or closed state are not counted.

Day or month

Count

Incident Level

Incident management

Collects the number of incident tickets of each type and level.

Collect the number of incident tickets of each type and level within a selected time range.

Day or month

Count

Incident Closure Rate

Incident management

Collects the closure rate of incident tickets.

Metric value = Number of incident tickets in the "Completed" status in the selected time period/Total number of incident tickets × 100%

Day or month

%

Average Incident Duration

Incident management

Collects the average handling duration of incident tickets.

Metric value = Total handling duration of closed incidents/Number of closed incidents × 100%

Day or month

ddhhmm

Affected Applications

Incident management

Collects the number of applications affected by an incident ticket.

Collect the number of affected applications (including deleted applications) of an incident ticket after deduplication.

Day or month

Count

Incident SLA Fulfillment Rate

SLA management

Collects statistics on the SLA fulfillment status of incident tickets.

Metric value = Number of incident tickets that do not violate the SLA/Total number of incident tickets included in the statistics × 100%

Day or month

%

War Rooms

WarRoom

War rooms

Collects the number of all war rooms set up.

Collect the number of all war rooms set up in a selected period.

Day or month

Count

Incident Level

Incident management

Collects the number of incidents of each level for setting up war rooms.

Collect the number of incidents of each level for setting up war rooms.

Day or month

Count

Affected Applications

War rooms

Collects the number of affected applications for setting up war rooms.

Calculate the number of affected applications for setting up war rooms after deduplication.

Day or month

Count

Average Handling Duration

War rooms

Collects the average duration for fault recovery from setting up war rooms.

Collect the total handling time of stopped war rooms in the selected period/Number of stopped war rooms.

Day or month

ddhhmm

Handling Time Window Distribution

War rooms

Collects the number of times war rooms are set up in each time window.

Collect the number of times war rooms are set up in each time window.

Day or month

Count

Problem Improvement

Total Issues

Issue management

Collects the number of issue tickets.

Collect the number of all issue tickets within a specified period except those in the canceled and to-be-submitted states.

Day or month

Count

Issue Closure Rate

Issue management

Collects the closure rate of issue tickets.

Metric value = Number of completed issue tickets/Total number of issue tickets × 100%

Day or month

%

Total Improvement Tickets

Improvement ticket management

Collects the number of improvement tickets.

Collect the number of all improvement tickets within a specified period except those in the draft state.

Day or month

Count

Improvement Ticket Completion Rate

Improvement ticket management

Collects the closure rate of improvement tickets.

Metric value = Number of completed improvement tickets/Total number of improvement tickets × 100%

Day or month

%

Monitoring and Alerting

The Alarm page displays alarm analysis, alarm costs, and alarm quality. These are key metrics displayed in charts. O&M personnel can quickly learn about the overall service status. The alarm analysis area displays the total number of alarms, alarm severity, top applications, alarm reduction, and alarm trend. By analyzing historical alarm data, the O&M supervisor can understand the trend and mode of service alarms and identify potential faults or performance deterioration. The alarm cost statistics include the alarm closure rate and automatic alarm handling rate. The O&M supervisor can effectively control the labor cost of changes based on the alarm cost. The alarm quality statistics collects incident ticket- and war room-triggered alarm detection rates, helping O&M supervisors evaluate the validity of current alarms and optimize alarm configurations in a timely manner. For details about the metrics included, see Table 4.

Figure 6 Monitoring and alarm reporting
Table 4 Metrics on the Alarm page

Module

Metric

Data Source

Metric Definition

Calculation Rule

Statistical Period

Unit

Alarm analysis

Total Alarms

Alarm management

Collects the number of alarms in the alarming state.

Collect the number of alarms within a specified period in the alarming state.

Day or month

Count

Alarm Severity

Alarm management

Collects the number of alarms of each severity in the alarming state.

Collect the number of alarms of each severity within a specified period in the alarming state.

Day or month

Count

Alarm Trend

Alarm management

Collects the trend of the number of alarms of each severity within the selected time range.

Collect the number of alarms of each severity within a selected time range

Day or month

Count

Alarm Cost

Alarm Closure Rate

Alarm management

Collects statistics on alarm closure.

Metric value = Number of closed alarms in the selected time range/Total number of alarms × 100%

Day or month

%

Automatic Alarm Handling Rate

Alarm management

Collects statistics on automatic alarm handling.

Metric Value = Number of automatically handled alarms in the selected time range/Total number of alarms ×100%

Day or month

%

Original Alarm > Alarm Quality

Fault Alarm Detection Rate

Incident management

Collects statistics on the number of incident tickets triggered by alarms.

Metric value = Number of incident tickets converted from alarms in the selected time range/Total number of incident tickets in the selected time range × 100%

Day or month

%

WarRoom Alarm Detection Rate

War rooms

Collects the number of war rooms triggered by alarms.

Metric value = Number of war rooms triggered by incidents converted from alarms in the selected time range/Total number of war rooms × 100%

Day or month

%

Aggregated Alarms > Alarm SLA

Alarm SLA Fulfillment Rate

SLA management

Collects statistics on the SLA fulfillment status of aggregated alarms.

Metric value = Number of alarm tickets that do not violate the SLA/Total number of alarm tickets included in the statistics × 100%

Day or month

%

For the Total Alarms metric, the bubbles on the page indicate the applications with top alarms of each severity. Top 1 is critical alarms, top 5 is major alarms, top 10 is minor alarms, and another top 10 is warning alarms.

Security Compliance

The Security Compliance page consists of statistics on scanned patches and account management. Patch scanning allows you to view instance compliance data by region, application, and OS, and display the number of scanned instances by time range.

Figure 7 Security Compliance page
Table 5 Metrics on the Security Compliance page

Module

Metric

Data Source

Metric Definition

Calculation Rule

Statistical Period

Unit

Patch Management

Instances Scan Statistics

Patch management/CloudCMDB

Collects the number of ECSs where patches have been scanned and have not been scanned under a tenant account.

The number of unscanned instances is the total number of instances minus the number of scanned instances.

Area and application

Count

Instance Compliance Status

Patch management

Collects the number of compliant and non-compliant instances in the scanned instances

Collect the number of instances in each compliance status in patch management.

Area and application

Count

Last Scan Time

Patch management

Collects statistics on the latest scanning time range of scanned instances.

Collect statistics on the latest scanning time range of scanned instances.

Area and application

Count

Account Management

Number of Managed Instances

Account management

Collects the number of managed cloud service instances in account management

Collect the number of managed cloud service instances in account management.

Area and application

Count

Management Rate

Account management

Collects the proportion of the managed cloud service instances to all instances

Management rate = Number of managed instances/Total number of instances × 100%.

Area and application

%

Managed Instance Statistics

Account management

Collects the statistics on the instance management trend by time period.

Collect the statistics on instance management trend by time period.

Area and application

-

Service Level Objective (SLO)

The SLO page covers the overall SLO achievement, SLO statistics by application, and error budget management. The overall SLO achievement area displays average annual and monthly SLO data and the overall SLO trend. The SLO statistics by application area displays SLO values by time and application for you to evaluate the service level of each application. The error budget area shows the error cost calculated on the SLO data of each application for you to evaluate changes or other high-risk operations. For details about the metrics included, see Table 6.

Figure 8 SLO dashboard
Table 6 Metrics on the SLO page

Module

Metric

Data Source

Metric Definition

Calculation Rule

Statistical Period

Unit

Overall SLO Achievement

Expected Annual SLO

SLO management

Collects the expected SLO value of an application in a year.

Expected SLO value = Expected SLO value set in the SLO management module

Expected SLO value of multiple applications = Average expected SLO value of applications

Year

%

Annual Actual SLO Value

SLO management

Collects the actual SLO achievement of an application in a year.

Actual SLO value in a year = 1 – (Annual service unavailability duration/Total application duration in a year) × 100%

Actual SLO value of multiple applications in a region = Yearly average SLO value of an application in a region

Actual SLO value of an application in multiple regions = Yearly minimum SLO value of an application in a region

Actual SLO value of multiple applications in multiple regions = Average actual SLO value of an application in multiple regions

Day or month

%

Non-compliant Applications

SLO management

Collects the number of applications that do not meet SLO expectations.

Calculate the number of applications that fail to achieve the SLO expectation. If all regions are selected and the actual SLO value of applications in any region in a year is less than the annual expected SLO value, the SLO exception is not met.

Day or month

Count

Expected Monthly SLO

SLO management

Collects the expected SLO achievement of an application in a month.

Expected SLO value = Expected SLO value set in the SLO management module

Expected SLO value of multiple applications = Average expected SLO value of applications

Day or month

%

Actual Monthly SLO

SLO management

Collects the actual SLO achievement in a month.

Actual SLO value in a month = 1 – (Monthly service unavailability duration/Total service duration in a month) × 100%

Actual SLO value of multiple applications in a region = Monthly average SLO value of an application in a region

Actual SLO value of an application in several regions = Monthly minimum SLO value of an application in a region

Actual SLO value of multiple applications in multiple regions = Average actual SLO value of an application in multiple regions

Day or month

%

SLO Statistics by Application

SLO Statistics by Application

SLO management

Collects SLO statistics by application.

Collect the monthly SLO actual value by application.

Actual SLO value in a month = 1 – (Monthly service unavailability duration/Total service duration in a month) × 100%

Actual SLO value of an application in several regions = Monthly minimum SLO value of an application in a region

Day or month

%

Error Budget

Error Budget

SLO management

Measures the difference between the actual performance and the expected performance and provides the error budgets.

If the actual SLO value is greater than the expected SLO value:

Error budgets = (Actual annual SLO value – Expected annual SLO value) × Total service duration in a year (minutes)

If the actual SLO value is less than or equal to the expected SLO value, the error budget is 0.

Day or month

Minute

PRR Dashboard

The PRR page consists of PRR statistics, evaluation radar profiles, and improvement task closure. The PRR statistics area shows the review phase of each service before the service is put into production and the review status. The evaluation radar profiles area shows the distribution of review items that do not meet service requirements. The improvement task closure area presents the rectification statuses of the items that do not meet the review requirements. For details about the metrics included, see Table 7.

Figure 9 PRR dashboard
Table 7 Metrics on the PRR page

Module

Metric

Data Source

Metric Definition

Calculation Rule

Statistical Period

Unit

PRR Statistics

Services Reviewed

PRR

Collects the number of services that are covered by the PRR.

Collect the total number of services are covered by the PRR within a selected time range. (Deduplicated services are excluded.)

Day or month

Count

Applications Reviewed

PRR

Collects the number of applications that are included in each PRR phase and the approval status.

Collect the number of applications included in each PRR phase and the review status within a selected time range. (Deduplicated applications are collected.)

Day or month

Count

Evaluation Radar Profiles

Evaluation Radar Profiles

PRR

Collects the distribution of PRR items that fail to be met.

Collect the number of review items that are not met in a selected time range.

Day or month

Count

Closure of Improvement Tasks

Improvement Task Distribution by Status

PRR

Collects the number of improvement tasks and their statuses.

Collect the number of improvement tasks and the statuses of the tasks within a selected time range.

Day or month

Count

Improvement Tasks

PRR

Collects the number of improvement tasks in each dimension and their closure statuses.

Collect the number of improvement tasks by review item and the statuses of these tasks.

Day or month

Count