Availability and SLO

Availability comprehensively measures reliability and resilience.

Availability is the percentage of time (such as 99.9%) services are up over a certain period of time (typically a month or year).

Availability = Available for use time/Total time x 100%

Availability is typically represented by a number of 9s or a combination of 9s and 5s. For instance, three 9s indicate an availability of 99.9%, while three 9s and one 5 indicates an availability of 99.95%.

The system availability objective is defined by a service level objective (SLO). Different application systems have different availability objectives. Defining the objectives is essential for assessing the resilience of an application system. The following table describes the SLOs of common IT systems:

SLO	Maximum Unavailability (per Year)	Typical IT Services
99%	3.65 days	Batch processing, background tasks, and data extraction
99.9%	8.76 hours	Internal knowledge management systems and project tracking systems
99.95%	4.38 hours	Customer account management services and information management services
99.99%	52.56 minutes	E-commerce, B2B web services, and heavy-traffic media/content websites
99.999%	5.26 minutes	Banking, investment, finance, government, telecoms, and critical enterprise applications

The system availability depends on the availability of each service unit in the system. There are two typical reliability models for service units:

Serial model: If any unit fails, the entire system fails.

Mathematical model for reliability:

For example, if a system consists of two units connected in series, each with an availability of 99.9%, the system's overall availability is calculated as follows: Rs = 99.9% × 99.9% = 99.8%.

The availability of the serial system is lower than that of a single unit in the system. To improve the system availability:
- Minimize the number of units in the system.
- Improve the availability of each unit. Reduce their failure rates as much as possible.
Parallel model: The system only fails if all of the units it comprises fail.

Mathematical model for reliability:

For example, if a system consists of two units connected in parallel, each with an availability of 99.9%, the system's overall availability is calculated as follows: Rs = 1 – (1 – 99.9%) x (1 – 99.9%) = 99.9999%.

Parallel connections can greatly enhance system availability. Common parallel technologies include active/standby, clusters, active-active, and multi-active setups.

To achieve an application system's availability objectives, you need to systematically define the availability requirements for its components and dependent components.

Availability requirements for dependent components: The SLO of key dependent components should be one 9 higher than that of other services. For instance, if an application system's SLO is 99.9%, its key dependent components must achieve an SLO of 99.99%.
Breakdown of the application system SLO : Break down the system SLO, fault frequency, and cloud service SLA to determine how long application components can tolerate an interruption. Then, further break down these requirements to establish fault detection, manual intervention, and recovery time targets.
Enhancement of weak links in the application system:
- If the cloud service SLA is insufficient, strengthen protection and enhancements at the application layer.
- Enhance availability with component redundancy and fail-back mechanisms. For example, set up an ELB cluster to add redundancy for ELB, and configure the system to fail back temporarily to SMN if there is ever a problem with DMS access.