RES07-01 Defining Key Metrics and Thresholds and Monitoring Such Metrics

Before monitoring resources, you need to define key metrics and thresholds to quickly and effectively detect service performance and system status. In this way, you can intervene and rectify faults as soon as possible or locate and fix system defects.

Risk level
Medium
Key strategies
- Ensure key metrics are related to key performance indicators (KPIs) of workloads in a system and can be used to identity early warning signals of system performance deterioration. For example, the number of API requests processed by the system and the request success rate can indicate system performance issues more truly than basic metrics such as CPU usage and memory usage.
- Application systems should be monitored at least from business status, service status, and resource status to ensure availability, effectiveness, and simplicity. Based on your business scale, you can choose Cloud Eye to monitor IaaS services or Application Operations Management (AOM) or Application Performance Management (APM) to monitor PaaS services. Alternatively, you can use Prometheus, Zabbix, and Zipkin to develop your own monitoring system, or use Grafana to visualize monitoring data and align time series.
1. Business monitoring

The following four golden metrics are summarized based on the experience of monitoring a large number of distributed services and can be used as a reference for business monitoring:
- Latency: Both successful and failed requests have latency. It is vital to differentiate between the latencies of successful and failed requests.
- Traffic: monitors the system service load.
- Error rate: There are explicit failures (such as HTTP 500 errors) and implicit failures (such as HTTP 200 errors). It is vital to differentiate between explicit and implicit failures.
- Saturation: focuses on monitoring bottleneck resources with the most limited capacity in a system.
For Java application systems, Huawei Cloud users can use APM to monitor the latency and error rate based on traces. FunctionGraph and Cloud Storage Engine (CSE) can help monitor the traffic, latency, and error rate. Applications that use API Gateway to expose APIs can use the traffic, latency, and error rate monitoring capabilities provided by API Gateway. If the capabilities of cloud services cannot meet system requirements, you can use their open APIs to develop your own monitoring system or use Zipkin to monitor tracing, latency, and traffic.

2. Service monitoring

Affected by the redundancy configuration of service instances and fault tolerance protection of application systems, normal service metrics do not necessarily indicate that service instances are functioning normally. For example, in a VM cluster with the load balancer configured, the load balancer proactively isolates faulty nodes. As a result, although requests are shared by healthy nodes, the processing capacity of the application system is still diminished. For this reason, cloud services must be monitored.

Cloud service metrics vary with functions and features. As function providers, cloud services require metrics such as the latency, traffic, error rate, and utilization. In addition, key reliability events of service instances, such as dynamic scaling, overload control, fault self-healing, and migration, also indicate service robustness. If there is an exception, manual intervention is required. You can use Cloud Eye or develop your own monitoring service to monitor key events.

Cloud Eye monitors metrics of IaaS services (such as ECS, EVS, OBS, VPC, ELB and AS), RDS databases, and high-availability middleware (such as DCS and DMS), and allows users to report custom metrics. If you set up a monitoring system by yourself, you can also use the Cloud Eye SDKs to obtain metrics of a cloud service.

AOM monitors key metrics of microservice applications and nodes. You can view key metrics of CCE workloads in CSE, and key metrics of FunctionGraph on the FunctionGraph console.

3. Resource monitoring

Resource monitoring is used to identify resource bottlenecks and analyze system performance issues. Before monitoring application system resources, you need to define key metrics and thresholds of resources to quickly and effectively detect service performance and system status. In this way, you can intervene and rectify faults as soon as possible or locate and fix system defects.

The Utilization Saturation and Errors (USE) Method is used to monitor resources by:
- Utilization: covers system resources, including but not limited to CPUs, memory, networks, and disks.
- Saturation: indicates the resource saturation, such as the CPU queue length. This saturation must be distinguished from the golden metric of service monitoring.
- Errors: resource processing errors, such as the network packet loss rate.
Cloud Eye provides fine-grained monitoring for VMs. Utilization and errors are also involved in other services. If the existing capabilities of cloud services cannot meet system requirements, you can create custom monitoring metrics using Cloud Eye or AOM. If you set up a monitoring system by yourself, the monitoring system must cover host resources, network devices, and third-party components such as Apache, Java, and MySQL. Open-source Zabbix is a common choice.
Related cloud services and tools
- Cloud Eye
- Application Operations Management (AOM)
- Application Performance Management (APM)