Overview

As a core performance metric widely recognized in the industry, service level objective (SLO) is a key quantitative standard for measuring the quality of services and applications. The core value of the SLO is to provide a unified and measurable service quality evaluation benchmark for service and technical teams, ensuring that service capabilities are aligned with service requirements.

According to the calculation logic, the actual SLO value reflects the service stability from the availability dimension. The formula is as follows: Actual SLO value = 1 – (Application unavailability duration/Total application duration) x 100%. The application unavailability duration refers to the accumulated time period during which the service cannot properly respond to service requests (excluding the pre-registered planned downtime). The total application duration refers to the complete time range (such as day, week, and month) in the statistical period. For example, if an application is unavailable for 10 minutes in a day, the actual SLO value is calculated as follows: 1 – (10/1,440) × 100% ≈ 99.31%. A larger value indicates higher service availability and lower service interruption risks.

In the O&M management system of COC, three types of core O&M service tickets are directly used for SLO calculation: war rooms, alarm service tickets, and incident tickets with specific attributes. The impact logic is as follows:

War rooms:
A war room is a service ticket started by COC for major service faults. When a large number of services are unavailable or core service links are interrupted, a war room can be created to trigger cross-team collaborative response. A fault that needs to be rectified using a war room usually causes long-time application unavailability. The application unavailability duration is the most significant factor affecting the SLO. For example, when a core service application is inaccessible to a large number of users due to a server cluster fault that lasts 2 hours, the O&M team creates a war room to start emergency response. If the statistical period is one day, the actual SLO value will be decreased from the normal state to 1 – (2/24) × 100% ≈ 91.67%, which directly lowers the overall service quality.
Alarm service tickets:
An alarm service ticket is a warning service ticket triggered by COC when the corresponding monitoring metric threshold is exceeded. It covers various service exception scenarios, such as high CPU usage, memory overflow, and too long network delay. Not all alarm service tickets has influences on the SLO. The alarm duration is counted in the application unavailability duration only when the service cannot provide functions properly (that is, the service is unavailable) due to the exception reported by the alarm. For example, if an alarm is triggered because the maximum number of database connections is reached and user requests cannot be processed within 30 minutes, the 30 minutes are counted as the service unavailable duration and the SLO will be influenced by this duration. If an alarm reports that the disk usage is close to the threshold but the service response is not affected, the SLO is not affected.
Incident tickets with attribute Service Interrupted set to Yes:
An incident ticket is a basic service ticket for COC to record a variety of running incidents of a service. The Service Interrupted attribute is the core basis for determining whether the SLO is affected the ticket. If this attribute of an incident is set to Yes, the incident has caused service function unavailability (for example, user login failure or order submission failure). In this case, the duration from the time when the incident is reported to the time when the incident is resolved is used as the application unavailability duration. If this attribute is set to No, (For example, if the backend service logs are abnormal but the frontend functions are not affected), the duration is recorded only for O&M requirements and will not be used to calculate the SLO.