Fault Management Overview
COC fault management provides you with the capabilities of quick fault demarcation, locating, and recovery. It supports ingestion of alarms from multiple sources. COC aggregates raw alarms and performs noise reduction on the alarms, and then convert corresponding alarms to incidents or aggregated alarms. Faults reported by the alarms or incidents will be quickly demarcated through the application topology diagnosis tool, or war rooms, and then be swiftly rectified based on online response plans with the MTTR shortened. All faults and their handling processes will be reviewed for service improvement. In addition, it continuously accumulates the fault management O&M knowledge base and improves the risk resistance capability.

Core Features
- Monitoring system integration: Alarm data from multiple monitoring platforms are ingested into COC for central management of raw alarms. Currently, Huawei Cloud Eye, AOM, APM, LTS, Alibaba Cloud CloudMonitor, Alibaba Cloud Simple Log Service (SLS), Prometheus, Grafana, Zabbix, and user-defined service monitoring systems (connected through open APIs) are supported.
- Alarm conversion rules: The prerequisite for using an alarm conversion rule is that some alarm data sources have been connected to the integration management module. An alarm conversion rule is used to convert raw alarms into aggregated alarms or incident tickets in COC based on a series of configuration items such as triggering conditions and triggering rules. You can assign owners to aggregated alarms or incident tickets, and preset response plans for faults.
- Alarm management: displays raw alarms and aggregated alarms, and allows you to perform operations on aggregated alarms, including clear alarms, convert alarms to incidents, and execute response plans.
- Incident management: manages the entire lifecycle of incident tickets, including manually creating incident tickets, accepting, rejecting, transfer, handling, escalating, and degrading incident tickets, and starting a war room.
- War room: This module applies to scenarios where major faults occur and personnel of different roles need to be quickly gathered to locate and rectify the faults. The war room page integrates affected applications, related alarms, incidents, and change information, and recovery progress notifications. You can execute response plans, diagnose applications, and start communication groups of third-party mainstream OA software.
- Issue management: You can discover, record, and resolve product function defects and poor performance during software product use.
- Improvement ticket management: You can monitor and close product, O&M, and management improvement items identified during troubleshooting through online improvement tickets.
- Fault diagnosis: You can use quick diagnosis tools to check statuses of ECSs, RDS DB, DCS, DMS, and ELB instances with only a few clicks, detect potential problems in a timely manner, and provide professional rectification suggestions and solutions for abnormal metrics.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot