O&M Situational Awareness
COC provides O&M situation awareness capabilities through monitoring of changes, incidents, alarms, security compliance, service level objectives (SLOs), production readiness reviews (PRRs), and more. In this module, you can view the overall O&M situation from macro to micro on an enterprise-level O&M sandbox.
- The dedicated O&M BI dashboard caters to various O&M roles, aiding in O&M optimization, insights, and decision-making.
- 30+ O&M metrics are preset, presenting O&M situations of your cloud resources or applications on 7 perspective-based dashboards and a comprehensive enterprise-level O&M sandbox.
- Organization administrators or delegated administrators can view the O&M situation data of organization member accounts across accounts, and aggregate data of multiple regions and applications across accounts.
Prerequisites
If you use the O&M situation awareness function in the single-account scenario, skip this section and see Procedure.
If you use the O&M situation awareness function across accounts, the following prerequisites must be completed:
1. Cross-account management has been enabled for the current account, and the account is an organization or delegated administrator account.
2. The COC service has been enabled for the organization member accounts of the current account.
Scenarios
View O&M situation data of your applications on COC.
Procedure
- Log in to COC.
- On the Overview page of COC, click O&M Situation Awareness.
- On the O&M Situation Awareness sandbox, filter the O&M data by region, application, or a specified duration as required.
- Filter O&M situation information by organization account, region, application, and date.
Figure 1 Filtering data by organization account
In the cross-account scenario, if no account is selected, the O&M situation data of the current account is displayed by default.
Figure 2 Application data aggregation in cross-account scenarios
O&M Overview
The O&M overview page consists of four modules: overview, risk reporting, PRR summary, and top 5 incidents. The overview module enables you to observe the O&M situation from the global perspective, facilitating O&M optimization, insights, and decision-making. The risk reporting module displays the O&M statuses and risks reported through P3 or more severe incident tickets, WarRoom requests, faults triggered by changes, and critical alarms. The PRR summary module provides the review statuses of your applications before they are released or put into commercial use. The top 5 incidents module displays the top 5 incidents that have the most severe impacts on your services to help you quickly identify major fault scenarios. For details about the metrics included, see Table 1.
Module |
Metric |
Data Source |
Metric Definition |
Calculation Rule |
Statistical Period |
Measurement Unit |
---|---|---|---|---|---|---|
Overview |
Incidents |
Incident center |
Collects the trend of the incident ticket quantity. |
Collect the number of incident tickets created in a selected period. |
Day or month |
Count |
Alarms |
Alarm center |
Collects the alarm quantity trend. |
Collect the number of alarms generated in a selected period. |
Day or month |
Count |
|
War Rooms |
War rooms |
Collects the WarRoom request quantity trend. |
Collect the number of WarRoom requests initiated in a selected period. |
Day or month |
Count |
|
Monitoring Discovery Rate |
Alarm center |
Collects the proportion of incidents that trigger specified alarms. |
Monitoring discovery rate = Number of incidents that meet the filter criteria and trigger specified alarms/Total number of incidents that meet the filter criteria |
Day or month |
% |
|
Changes |
Change management |
Collects the change ticket quantity trend. |
Collect the number of change tickets created in a selected period. |
Day or month |
Count |
|
Cloud Service SLO |
SLO management |
Collects the change trend of the actual SLO value of a cloud service. |
Cloud service SLO = 1 – (Unavailability duration of the cloud service/Total duration of the cloud service) x 100% |
Day or month |
% |
|
Risk reporting |
Change-triggered Incidents |
Incident management |
Collects the number of incidents caused by changes. |
Collect the number of incident tickets whose incident type is change. |
Day or month |
Count |
Critical Alarms in Last 7 Days |
Alarm center |
Collects the number of critical alarms in the last 7 days. |
Collect the number of critical alarms in the last 7 days. |
Last 7 days |
Count |
|
P3 or More Severe Incidents |
Incident management |
Calculates the number of P3 or more severe incidents. |
Collect the total number of P1, P2, and P3 incidents, including unhandled incidents. |
Day or month |
Count |
|
WarRoom Requests |
Alarm center |
Collects the number of WarRoom requests. |
Collect the number of WarRoom requests initiated in a selected period. |
Day or month |
Count |
|
PRR summary |
PRR |
PRR |
Collects the number of services that are covered by a PRR. |
Collect the number of services that are covered by a PRR. |
Day or month |
Count |
PRR passing |
PRR |
Collects the number of services passed or failed a PRR in each PRR phase. |
Collect the number of services passed or failed a PRR in each PRR phase. |
Day or month |
Count |
|
Top 5 incidents |
Top 5 Incidents |
Incident management |
Collects the top 5 most severe incidents. |
Collect the number of handled P3 or more severe incidents in a specified period, rank the incidents by severity first and then by interruption duration to obtain the top 5 most severe incidents. |
Day or month |
Incident information |
Changes
The Changes page consists of three modules: data overview, change overhead, and change risks, comprehensively displaying change statuses of your applications or cloud services using core change metrics. The data overview module encompasses various metrics, inducing change duration, success rate, and automated change rate. COC uses these metrics to present the overall change statistics of your services on change trend charts that are bolstered by required change data. The change risk module displays the faults caused by changes and provides the change success rate, as well as the change level and change method distribution charts. The change overhead module shows the trends of the labor required and time consumed by your services in a specified period so that you can control your change overhead as required. For details about the metrics included, see Table 2.
Metric |
Data Source |
Metric Definition |
Calculation Rule |
Statistical Period |
Measurement Unit |
---|---|---|---|---|---|
Change-caused Incidents on the Live Network |
Change management |
Collects the number of change-caused incidents of each level on the live network. |
Collect the number of incident tickets created for each level of incidents that are caused by changes within a selected time range. |
Day or month |
Count |
Change Level |
Change management |
Collects the number of change tickets for each level of changes. |
Collect the number of change tickets for each level of changes in a selected period. |
Day or month |
Count |
Change Method |
Change management |
Collects the number of change tickets that employ different change methods, such as automated and manual changes, respectively. |
Collect the number of change tickets for each change method. |
Day or month |
Count |
Total Changes |
Change management |
Collects the number of change tickets. |
Collect the number of change tickets completed in a selected period. |
Day or month |
Count |
Change Success Rate |
Change management |
Collects the success rate of change tickets. |
Change success rate = Number change tickets that are handled/Total number of change tickets that are handled and failed x 100% |
Day or month |
% |
Average Change Duration |
Change management |
Collects the average duration for handling change tickets. |
Average change duration = Total duration required by handled change tickets in a selected period/Number of handled change tickets x 100% |
Day or month |
ddhhmm |
Automatic Change Rate |
Change management |
Collects the proportion of automatic changes in all change tickets. |
Automatic change rate = Number of automatic changes/Total number of change tickets x 100% |
Day or month |
% |
Change Trend |
Change management |
Collects the number of successful and failed changes and change success rate trend. |
Collect the number of successful and failed changes and change success rate trend. |
Day or month |
Count |
Change Manpower |
Change management |
Collects the number of O&M engineers required in changes. |
Change labor = Number of change coordinators + Number of change implementers |
Day or month |
Person-time |
Change Duration |
Change management |
Collects the average handling duration of change tickets. |
Average change handling duration = Total duration required by handled change tickets in a selected period/Number of handled change tickets x 100% |
Day or month |
ddhhmm |
Fault Management
Fault Management consists of three modules: incident statistics, WarRoom, and backtracking and improvement. These modules leverage core metrics of the entire incident management process to manage and handle incidents efficiently. Backed by metrics such as incident quantity, closure rate, handling duration, and number of damaged applications, the incident statistics module presents incident risks of your cloud services and applications on incident risk trend charts and top/bottom ranking charts with change data marked. The WarRoom module encompasses damaged applications, levels and time windows of incidents that trigger WarRoom request initiation, warning the occurrence of major fault scenarios and representing the fault handling. The backtracking and improvement module includes the fault closure rate and trend analysis of fault backtracking and improvement to ensure that experience in handling known faults is accumulated, reducing the frequency and handling duration of similar faults. For details about the metrics included, see Table 3.
Module |
Metric |
Data Source |
Metric Definition |
Calculation Rule |
Statistical Period |
Measurement Unit |
---|---|---|---|---|---|---|
Incident statistics |
Total Incidents |
Incident management |
Collects the total number of incident tickets. |
Collect the number of incident tickets created in a selected period. |
Day or month |
Count |
Incident Level |
Incident management |
Collects the number of incident tickets of each type and level. |
Collect the number of incident tickets of each type and level within a selected time range. |
Day or month |
Count |
|
Incident Closure Rate |
Incident management |
Collects the closure rate incident tickets. |
Incident ticket closure rate = Number of closed incident tickets within a selected time range/Total number of incident tickets x 100% |
Day or month |
% |
|
Incident Duration |
Incident management |
Collects the average handling duration of incident tickets. |
Incident handling duration = Total handling duration of closed incidents/Number of closed incidents x 100% |
Day or month |
ddhhmm |
|
Affected Applications |
Incident management |
Collects the number of applications affected by an incident ticket. |
Collect the number of affected applications (including deleted applications) of an incident ticket after deduplication. |
Day or month |
Count |
|
War rooms |
WarRoom Requests |
War rooms |
Collects the number of all WarRoom requests. |
Collect the number of WarRoom requests initiated in a selected period. |
Day or month |
Count |
Fault Level |
Incident management |
Collects the number of incidents of each level for a WarRoom request. |
Calculate the number of incidents of each level for a war room request. |
Day or month |
Count |
|
Affected Applications |
War rooms |
Collects the number of affected applications for a war room request. |
Calculate the number of affected applications for a WarRoom request after deduplication. |
Day or month |
Count |
|
Average Recovery Duration |
War rooms |
Collects the average duration for fault recovery from a WarRoom request. |
Average WarRoom recovery duration = Total duration required by handled WarRoom requests within a selected time range/Number of handled WarRoom requests |
Day or month |
ddhhmm |
|
Distribution of Handling Time Windows |
War rooms |
Collects the number of times WarRoom requests are initiated in each time window. |
Collect the number of times WarRoom request are initiated in each time window. |
Day or month |
Count |
|
Backtracking and improvement |
Backtracking Tickets |
Issue Management |
Collects the number of backtracking tickets. |
Total number of backtracking tickets in a statistical period |
Day or month |
Count |
Closure Rate of Backtracking Tickets |
Issue Management |
Collects the closure rate of backtracking tickets. |
Closure rate of backtracking tickets = Number of closed backtracking tickets/Total number of backtracking tickets x 100% |
Day or month |
% |
|
Total Improvement Tickets |
Issue Management |
Collects the number of improvement tickets. |
Collect the total number of improvement tickets in a statistical period. |
Day or month |
Count |
|
Improvement Ticket Closure Rate |
Issue Management |
Collects the closure rate of improvement tickets. |
Closure rate of improvement tickets = Number of closed improvement tickets/Total number of improvement tickets x 100% |
Day or month |
% |
Monitoring and Alerting
The alerting and monitoring package displays alarm information in charts, helping O&M engineers quickly learn about the overall service status. The altering and monitoring package consists of three modules: alarm analysis, alarm costs, and alarm quality, reflecting core metrics of alarm management. Alarm analysis provides the metrics for calculating the total number of alarms, alarm severity, top 10 applications, alarm reduction, and alarm trend. By analyzing historical alarm data, the O&M supervisor can understand the trend and mode of service alarms and detect potential performance problems or potential faults. The alarm cost statistics include the alarm manpower and automatic handling rate. The O&M supervisor can effectively control the labor cost of changes based on the alarm cost. The alarm quality statistics function collects statistics on incident ticket- and war room-triggered alarm detection rates, helping O&M supervisors evaluate the validity of current alarms and optimize alarm configurations in a timely manner. For details about the metrics included, see Table 4.
Module |
Metric |
Data Source |
Metric Definition |
Calculation Rule |
Statistical Period |
Measurement Unit |
---|---|---|---|---|---|---|
Alarm analysis |
Alarms |
Alarms |
Collects the total number of alarms. |
Collects the number of alarms generated in a selected period. |
Day/Month |
Count |
Alarm Severity |
Alarms |
Collects the number of alarms of each severity. |
Number of alarms of each severity within the selected time range |
Day/Month |
Count |
|
Alarm Trend |
Alarms |
Collects the trend of the number of alarms of each severity within the selected time range. |
Number of alarms of each severity within the selected time range |
Day/Month |
Count |
|
Alerting Cost |
Persons Involved |
Alarms |
Collects the number of alarm handling participants. |
Number of owners (deduplicated) for integrated alarms |
Day/Month |
Person |
Alarms Handled Per Capita |
Alarms |
Collects the number of alarms handled by per person. |
Total number of alarms in the selected time range/Number of alarm handling participants in the selected time range |
Day/Month |
Person |
|
Automatic Alarm Handling Rate |
Alarms |
Collects statistics on automatic alarm handling. |
Number of automatically handled alarms in the selected time range/Total number of alarms x 100% |
Day/Month |
% |
|
Alarm Quality |
Fault alarm detection rate |
Incident Management |
Collects statistics on the number of incident tickets triggered by alarms. |
Number of incident tickets converted from alarms in the selected time range/Total number of incident tickets in the selected time range x 100% |
Day/Month |
% |
War Room Alarm Detection Rate |
War rooms |
Collects the number of war rooms triggered by alarms. |
Number of war rooms triggered by incidents converted from alarms in the selected time range/War rooms Total quantity x 100% |
Day/Month |
% |
|
Alarms Reported |
Alarms Reported |
Alarms |
Displays alarm risks reported by application. |
Weighted calculation and sorting based on the severity and quantity of alarms reported for an application |
Day/Month |
N/A |
Security Compliance
The security compliance module includes statistics on the number of scanned patches and account management data (coming soon). Patch scanning allows you to view instance compliance data by region, application, and OS, and display the number of scanned instances by time range.
Module |
Metric |
Data Source |
Metric Definition |
Calculation Rule |
Statistical Period |
Measurement Unit |
---|---|---|---|---|---|---|
Patch Management |
Instance scanning status |
Patch management/CloudCMDB |
Number of ECSs where patches have been scanned and have not been scanned under a tenant account |
Unscanned instances = Total instances – Scanned instances |
Area and application |
Count |
Instance compliance status |
Patch management |
Number of compliant and non-compliant instances in the scanned instances |
Collects statistics on the number of instances in each compliance status in patch management. |
Area and application |
Count |
|
Last Scan Time |
Patch management |
Collect statistics on the latest scanning time range of scanned instances. |
Collect statistics on the latest scanning time range of scanned instances. |
Area and application |
Count |
|
Account Management |
Managed Instances |
Account Management |
Number of managed cloud service instances in account management |
Number of managed cloud service instances in account management |
Area and application |
Count |
Management Rate |
Account management |
Proportion of the managed cloud service instances to all instances |
Management rate = Number of managed instances/Total number of instances x 100% |
Area and application |
% |
|
Managed Instance Statistics |
Account management |
This metric displays the instance management trend by time period. |
This metric displays the instance management trend by time period. |
Area and application |
- |
SLO Dashboard
The service level objective (SLO) dashboard covers the overall SLO achievement, application-dimension SLO statistics, and error budget management. In the Overall SLO Achievement area, you can view SLO values by year and month and the overall service level trend. In the SLO Statistics by Application area, you can view SLO values by time and application and evaluate the service level of each application. The Error Budgets module shows the error budget based on the SLO values of each application to provide guidance for changes or other high-risk operations. For details about the metrics included, see Table 5.
Module |
Metric |
Data Source |
Metric Definition |
Calculation Rule |
Statistical Period |
Measurement Unit |
---|---|---|---|---|---|---|
SLO achievement |
Annual Expected SLO Value |
SLO management |
Expected SLO value of applications in a year |
Expected SLO value = Expected SLO value set in the SLO management module Expected SLO value of multiple applications = Average expected SLO value of applications |
Year |
% |
Annual Actual SLO Value |
SLO management |
Collects the actual SLO achievement of an application in a year. |
Actual SLO value in a year = 1 – (Annual service unavailability duration/Total application duration in a year) x 100% Actual SLO value of multiple applications in a region = Average actual SLO value of these applications in a year Actual SLO value of an application in several regions in a year = Minimum actual SLO value of the application in multiple regions in a year Actual SLO value of multiple applications in multiple regions = Average actual SLO value of these applications in multiple regions in a year |
Day or month |
% |
|
Applications That Do Not Meet Exceptions |
SLO management |
Collects the number of applications that do not meet SLO expectations. |
Calculate the number of applications that fail to achieve the SLO expectation. If all regions are selected and the actual SLO value of applications in any region in a year is less than the annual expected SLO value, the SLO exception is not met. |
Day or month |
Count |
|
Monthly Expected SLO Value |
SLO management |
Collects the expected SLO achievement of an application in a month. |
Expected SLO value = Expected SLO value set in the SLO management module Expected SLO value of multiple applications = Average expected SLO value of applications |
Day or month |
% |
|
Monthly Actual SLO Value |
SLO management |
Collects the actual SLO achievement in a month. |
Actual SLO value in a month = 1 – (Monthly service unavailability duration/Total service duration in a month) x 100% Actual monthly SLO value of multiple applications in a region = Average actual SLO value of these applications in a month Actual SLO value of an application in several regions = Minimum actual SLO value of the application in multiple regions in a month Actual SLO value of multiple applications in multiple regions = Average actual SLO value of these applications in multiple regions in a year |
Day or month |
% |
|
SLO statistics by application |
SLO statistics by application |
SLO management |
Collects SLO statistics by application. |
Collect the monthly SLO actual value by application. Actual SLO value in a month = 1 – (Monthly service unavailability duration/Total service duration in a month) x 100% Actual SLO value of an application in several regions in a month = Minimum actual SLO value of the application in multiple regions in a month |
Day or month |
% |
Error budgets |
Error Budgets |
SLO management |
Measures the difference between the actual performance and the expected performance and provides the error budgets. |
If the actual SLO value is greater than the expected SLO value: Error budgets = (Actual annual SLO value – Expected annual SLO value) x Total service duration in a year (minutes) If the actual SLO value is less than or equal to the expected SLO value, the error budget is 0. |
Day or month |
Minute |
PRR Dashboard
The PRR dashboard encompasses the review service summary, evaluation radar distribution, service review, and improvement task closure. The review service summary module shows the review phase of each service before the service is put into production and the review status. The evaluation radar distribution module shows the distribution of review items that do not meet service requirements. The service review and improvement module presents the rectification statuses of the items that do not meet the review requirements. For details about the metrics included, see Table 6.
Module |
Metric |
Data Source |
Metric Definition |
Calculation Rule |
Statistical Period |
Measurement Unit |
---|---|---|---|---|---|---|
Service PRR Summary |
Total Review Services |
PRR |
Collects the number of services that are included in the PRR. |
Collect the total number of services are covered by the PRR within a selected time range. |
Day or month |
Count |
Service PRR summary |
PRR |
Collects the number of services that are included in each PRR phase and the approval status. |
Collect the number of sources included in each PRR phase and the approval status within a selected time range. |
Day or month |
Count |
|
Evaluation radar distribution chart |
Evaluation radar distribution |
PRR |
Collects the distribution of PRR items that fail to be met. |
Collect the number of review items that are not met in a selected time range. |
Day or month |
Count |
Service review |
Services to Be Reviewed |
PRR |
Collects the total number of services to be reviewed and the approval status. |
Collect the total number of services to be reviewed and service approval status within a selected time range. |
Day or month |
Count |
Closure of improvement tasks |
Task Closure Statistics |
PRR |
Collects the number of improvement tasks and their closure statuses. |
Collect the number of improvement tasks and the closure statuses of the tasks within a selected time range. |
Day or month |
Count |
Improvement Tasks |
PRR |
Collects the number of improvement tasks in each dimension and their closure statuses. |
Collect the number of improvement tasks by review item and the closure statuses of these tasks. |
Day or month |
Count |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot