Functions
This section describes the main functions of Cloud Operations Center (COC).
Overview
The COC overview page contains multiple modules, including the O&M probability, resource dashboard, resource monitoring, security overview, quick configuration center, and O&M BI. You can view and perform operations on work items with ease on the overview page, enjoying simplified and highly efficient O&M. For more information, see Overview.
Resource Management
In the Information Technology Infrastructure Library (ITIL) process, the infrastructure resource-oriented management approach can cause problems such as data isolation and information inconsistency between O&M services. The resource management module of COC can centrally manage core resources of Huawei Cloud and other clouds and offline IDC resources, quickly providing accurate and consistent resource configuration data for features such as change management and batch O&M. COC leverages the following mechanisms to implement unified resource management:
- Resource discovery and identification: COC can automatically discover and identify offline resources of Huawei Cloud, peer vendor clouds, and IDCs, and manage them centrally.
- Resource monitoring and management: Through a unified monitoring page, O&M engineers monitor resource usage in real time and dynamically adjust resource usage.
- Data synchronization and consistency: COC supports data synchronization to ensure data consistency and accuracy between O&M services.
For more information, see Resource Management Overview.
Application Management
COC provides an application-centric resource management view that is bolstered by the capability of modeling the association between applications and resources. By using this feature, you can manage your resources by application, region, resource group, or resource model, query resources in a resource list by tag, and install the UniAgent components. You can use the application management function of COC to manage resources by group and manage the relationship between cloud service objects and applications. The management scope includes core resources of Huawei Cloud, other clouds (currently, Alibaba Cloud, AWS, and Azure are supported), and on-premises IDC resources, provides unified and reliable resource group information for functions such as chaos drills, change management, and account management.
For more information, see Application Management Overview.
Batch Resource Processing
COC delivers the batch resource operation capability that allows you to centrally manage multiple types of resources, such as Elastic Cloud Servers (ECSs), Relational Database Service (RDS) DB instances, FlexusL instances, and Bare Metal Servers (BMSs). It supports a variety of operation scenarios, including batch start, stop, and restart, OS reinstallation, and OS change, meeting resource operation requirements in different O&M phases. For more information, see Batch Resource Processing Overview.
Script Management
The script management function of COC is a core tool that helps you implement O&M automation. It provides efficient and accurate solutions for complex or repetitive O&M tasks. With script execution tools, you do not need to perform a large number of complex manual operations, configure devices one by one, and repeatedly execute tasks. Instead, you can create scripts to complete tasks at a time. This greatly shortens the task handling time and effectively avoids human misoperations, fundamentally improves the efficiency and accuracy of O&M. Create, modify, and delete scripts, and execute your own and public scripts on VMs For more information, see Script Management Overview.
Job Management
Job management is a core tool for operation automation. It orchestrates atomic actions (such as restarting instances and executing scripts) in a structured process to form a reusable, manageable, and standard operation set, which is called a job. The core capabilities include job lifecycle management and cross-instance batch execution. It aims to help you efficiently complete repeated operations, reduce manual error risks, and implement standardized and version-based management of operation processes.
For more information, see Job Management Overview.
Scheduled O&M
Scheduled O&M is an important module of COC for automatic scheduling of O&M tasks. This module clearly displays scheduled task details (such as the task name, type, execution time, and status) and task execution records (including the execution time, result, and logs). You can create scheduled tasks and manage them, such as modifying, pausing, enabling, and deleting tasks.
For more information, see Scheduled O&M Overview.
Account Management
You can centrally manage human-machine accounts of Huawei Cloud ECSs, RDS DB instances, GaussDB instances, and middleware. We collect multiple accounts in one place to avoid risks like forgetting passwords or having them leaked. You can get host passwords using account management. With security controls, you can log in to Linux hosts and run commands without entering passwords.
For more information, see Account Management Overview.
Parameter Center
The parameter center is developed to provide you with secure and reliable parameter storage and full-lifecycle management and control capabilities through centralized and standardized management, resolving pain points such as scattered data, security risks, and complex reference. Manage parameters throughout the whole service lifecycle in regions to continuously monitor parameter correctness and consistency. You can quickly reference O&M scenarios such as job orchestration.
For more information, see Parameter Center Overview.
OS Version Change
OS version change is a functional module that focuses on host OS upgrade management in COC. It provides convenient and efficient OS version change capabilities for hosts. With this function, you can easily create an OS version change task to upgrade multiple hosts in batches, greatly improving the OS upgrade efficiency.
For more information, see OS Version Change Overview.
Fault Management
COC fault management provides you with the capabilities of quick fault demarcation, locating, and recovery. It supports ingestion of alarms from multiple sources. COC aggregates raw alarms and performs noise reduction on the alarms, and then convert corresponding alarms to incidents or aggregated alarms. Faults reported by the alarms or incidents will be quickly demarcated through the application topology diagnosis tool, or war rooms, and then be swiftly rectified based on online response plans with the MTTR shortened. All faults and their handling processes will be reviewed for service improvement. In addition, it continuously accumulates the fault management O&M knowledge base and improves the risk resistance capability.
|
Module |
Description |
Operation Guide |
|---|---|---|
|
Alarm Management |
You can use collect, aggregate, and convert alarm data, and configure and manage alarm rules. |
|
|
Incident Management |
The incident management module manages all incidents of applications, including incident acceptance and rejection, ticket conversion, processing, and closing. Incidents can be generated based on alarm conversion rules, or created by users or based on alarms. |
|
|
War Room |
When there is a major or critical fault, a war room can be set up to quickly convene experts such as fault analysis members and application SRE engineers to rectify the fault. This improves the efficiency of collaborative communication, fault diagnosis and demarcation, and fault handling. War rooms also enable you to quickly detect and respond to incidents, shortening the MTTR. |
|
|
Improvement Ticket Management |
Improvement ticket management is the process of tracking and closing improvement tickets for product, O&M, or management issues found during incident or war room handling, or during drills. |
|
|
Issue Ticket Management |
Issue management is the process of first discovering issues such as product function defects and poor performance issues during the use of software products, and then recording the fault root causes and resolving the issues during the application. Setting up war rooms is mainly used to reduce the number of product or service faults on the live network. This improves the overall service quality, promote the continuous improvement of product or application quality, and prevent issues from recurring. |
|
|
Alarm Conversion Rules |
Alarm conversion rules suppress, reduce noise, deduplicate, and distribute routes for all received raw alarms. Vertical suppression and horizontal convergence of multiple monitoring sources are supported for multi-dimensional noise reduction. When configuring an incident forwarding rule, you can specify default objects for assigning incidents and configure notification policy for precise accurate notification. |
|
|
Data Source Management |
Data source management aims to provide you with an easy and quick way to interconnect COC with existing and third-party monitoring systems, such as Huawei Cloud Eye, AOM, and other monitoring tools. The core value is to collect alarm information scattered in different monitoring systems of the same service centrally to implement centralized management, preventing monitoring blind spots and complex management caused by scattered alarm data on different platforms. |
Change Management
Change management is the core module for ensuring secure and orderly O&M operations. Its core function is to build safe production capabilities covering the entire lifecycle of O&M operations. This module uses systematic process design and multi-level risk control mechanisms to accurately identify potential risks and develop countermeasures in advance, effectively reducing risks during change operations, provides solid assurance for the stable running of the O&M system. This module manages the core services of the change process. It integrates key capabilities such as change calendar, change center, change configuration, and change control. These capabilities work together to form a closed-loop change management system including planning, execution, configuration, and monitoring.
For more information, see Change Management Overview.
Chaos Drills
COC allows you to perform automatic chaos drills covering from risk identification, emergency plan management, fault injection, and review and improvement. Based on years of best practices of Huawei Cloud SRE in chaos drills, customers can proactively identify, mitigate, and verify risks of cloud applications, improving the resilience of cloud applications.
For more information, see Chaos Drill Overview.
To-Do Center
The to-do center is used to record and track daily to-do tasks to remind you of the tasks.
In the COC to-do center, you can create a to-do task and assign it to a specified engineer for processing. You can set the deadline and enter the recommended solution for the to-do task. After the to-do task is created, the owner can be notified by SMS messages or emails.
For more information, see To-Do Center Overview.
Personnel Management
You can centrally manage O&M engineers on COC using this feature. On the page, you can manage users who log in through different login methods, including IAM users, IAM federated users, and IAM Identity Center users. Data on the target page is the basic user data of COC and is available for authorized users to use the basic functional modules such as to-do task creation, scheduled O&M, notification management, and incident center.
For more information, see Personnel Management Overview.
Shift Management
You can customize a unified, multi-dimensional, and multi-form personnel management system on COC. This function is widely used in scenarios where owners are involved, such as service review and service ticket transfer. You can manage shift scenarios on the shift schedule management page and add personnel on the O&M Personnel Management page to shift schedules. Manage O&M personnel centrally, from multiple dimensions, in different forms, or based on your other custom requirements. You can also create shift scenarios and roles and add personnel managed on the Personnel Management page to the scenarios and roles as required.
- When you need to configure or obtain O&M engineers in a shift, go to the shift management page to configure or query a shift.
- Created shifts can be directly used to configure personnel parameters when using O&M service modules such as alarm conversion rules, incident center, automated O&M, notification management, and change ticket management.
For more information, see Shift Management Overview.
Notification Management
You can use notification templates for changes, incidents, issues, and alarms with various notification modes in different service scenarios and process phases. You can subscribe to notifications as required to avoid missing important information. When an incident ticket, issue ticket, alarm ticket, or change ticket is generated, the corresponding notification rules match the information about the incident, issue, alarm, or change are matched. Then, the system parses and obtains the recipients, the notification content, and notification method, and finally send the corresponding notifications. Notification modes are classified into incident, issue, change, and alarm notifications.
For more information, see Notification Management.
Mobile Application Management
You can manage configurations of third-party mobile apps and configure parameters for a war room when an incident requires the war room on a third-party mobile app. For more information, see Mobile App Management.
SLA Management
Service Level Agreement (SLA) is generally used to measure the service quality in the industry. It defines the quality standard, delivery method, and acceptable performance level of a service. The SLA management function of COC provides the service ticket validity period management capability. When a service ticket triggers an SLA rule, COC records the SLA trigger details for the service ticket and notifies the corresponding users to follow up and handle the service ticket in a timely manner.
For more information, see SLA Management.
SLO Management
As a core performance metric widely recognized in the industry, service level objective (SLO) is a key quantitative standard for measuring the quality of services and applications. The core value of the SLO is to provide a unified and measurable service quality evaluation benchmark for service and technical teams, ensuring that service capabilities are aligned with service requirements.
For more information, see SLO Management.
Process Management
You can customize the incident process, issue process, and change scenario. You can use the customized process management configuration for the fault management and change management modules as needed.
For more information, see Process Management.
Report Subscription
The report subscription function is used by O&M personnel to collect O&M data and report service statuses. It provides automatic and periodic O&M data statistics reports. This feature addresses the issues of inefficiency in traditional manual collection and sorting of O&M data, as well as the high labor costs associated with statistical analysis.
The report data comes from the O&M BI dashboards delivered by COC. When creating a subscription report, you can set subscription parameters such as the report sending frequency, report content, and recipients. Then, recipients can periodically receive the subscribed report in their email addresses. You can also view and download historical reports on the report subscription page.
For more information, see Subscribing to a Report.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot