Concepts

Concept	Description
Resilience	Resilience is a measure of how likely a system is able to withstand and recover from failures and remain in a known operational state (even in a degraded state) in the event of faults. Resilience indicates how likely core functions and data can be restored to recover services.
Reliability	Reliability is a measure of how likely a product is able to perform its functions for a specified period of time under predefined conditions.
Availability	Availability is the percentage of time that a product is available for use.
Service-level indicator (SLI)	An SLI is a quantitative measurement of a particular aspect of a service's performance. You can use an SLI to measure your service's response to requests.
Service-level objective (SLO)	An SLO is a target or goal set for an SLI. For example, an SLO might specify that the request response success rate should be greater than xx% within a certain period of time, or the percentage of uptime should be higher than xx%.
Service-level agreement (SLA)	An SLA is a formal contract between a service provider and its customers that defines expected service performance, including penalties or compensations if the agreed-upon levels are not met.
Recovery point objective (RPO)	RPO is the maximum acceptable length of time that data can be lost in the event of an incident.
Recovery time objective (RTO)	RTO is the maximum acceptable length of time that an application can be unavailable after an incident. It is the maximum time a system takes to recover from a fault.

There is no industry-wide agreed upon definition of resilience. In a narrow sense, resilience is the ability to automatically or quickly recover from faults. In a broader sense, resilience also includes fault tolerance.

Fault tolerance is the ability of a system to continue providing services when one or more faults occur in some components of the system. More broadly, resilience encompasses not only fault recovery but also fault tolerance—the capacity to continue functioning despite failures.

Reliability is also categorized into narrow-sense reliability and broad-sense reliability. In a narrow sense, reliability engineering focuses on enhancing a system's ability to operate fault-free. In a broad sense, reliability engineering not only enhances reliability but also improves fault recovery (known as maintainability), along with other fault-related attributes such as availability and supportability.

Broadly speaking, there are no significant differences between resilience and reliability. The only difference is their focus. Reliability focuses on minimizing faults and ensuring a system operates without failures. Resilience, on the other hand, acknowledges that faults are inevitable and focuses on minimizing their impacts and recovering quickly.

Parent topic: Concepts

Previous topic: Concepts

Next topic: Application Resilience