Design Principles

Faults, such as hardware failures, software errors, network delays, or any other anomalies triggered by traffic surges, are inevitable. Therefore, high availability (HA) application systems must be designed with the possibility of failure in mind for all hardware and software at the layers of IaaS, PaaS, and SaaS, as well as in other application systems. The objective of resilience design is not to prevent these faults, but to minimize their impacts on the system and ensure continuous and stable operations. The design principles below need to be followed:

HA

A single point of failure (SPOF) may cause an entire system to break down, affect main functions, delay tasks, or cause major faults. To address this problem, the HA design of a system is critical.

HA design mainly uses redundancy or even multi-level redundancy, including remote DR, to ensure that no single point of failure exists in the case of a disaster.

Redundancy mechanism: If possible, redundancy or even multi-level redundancy (such as 1+1, N+1, or N-way redundancy) must be considered for key components.
Remote DR: An example is two-site three-center DR. This ensures that services can be provided even if a disaster occurs.
Data redundancy: Periodic backup and multi-copy backup can be used to improve data durability and ensure data consistency.

The increase of redundancy means the increase of costs, which must be considered during HA design.

Comprehensive Fault Detection

Fault detection is the prerequisite of fault management. Both comprehensive detection and fast detection are important. Generally, the former is more important than the latter.

Fault detection involves the following aspects:

Detection scope: All components are identified and tracked, with a focus on detecting faults that have a major impact.
- Subhealth detection: Subhealth exceptions that do not cause system faults but cause system or service KPIs to deteriorate need to be detected, such as increased network latency, slow disks, and memory leakage.
- Standby module detection: In a redundancy system, faults of both the active and standby modules need to be detected to avoid silent faults.
- Components with lifespan limits: The health status of components with lifespan limits (such as local hard disks) must be monitored in a timely manner. Maintenance measures must be taken based on warnings to prevent serious impacts caused by unexpected faults.
Detection speed: An appropriate detection speed needs to be determined based on overall service requirements.
Detection impact: The interval of periodic fault detection must be determined based on its impact on the CPU usage and the impact of detection delay on the service recovery speed.
Streamlined detection: The system and modules used for fault detection need to be simpler than the detected system and modules.

After a fault is detected, it should be reported to the monitoring system and rectified in a timely manner.

Rapid Fault Recovery

Fault recovery refers to the capability of restoring a product's intended functionality after it was struck by a fault. Generally, the faster the recovery, the less the impact.

A proper fault recovery solution need to be designed based on the service requirements, technical implementation difficulty, technical solution complexity, and cost.

Automatic recovery: If a fault is one that would affect services, the system should be able to recover from the fault automatically by means of protection switching, partial reset, or system services.
Priority recovery: Faults that have a high probability of occurrence and great impact are preferentially rectified.
Hierarchical reset: Resource reset at a lower level is designed to minimize the impact on services.
Non-coupling recovery: The system startup should not be affected by partial system faults or component startup sequences.
Layered protection: Network layering must be considered for system fault protection. The protection switching at the lower layer must be more sensitive than that at the upper layer to avoid ping-pong switching.

Monitoring the system status or key metrics of the system load helps determine whether a fault occurs in the system. If it does, the fault can be automatically rectified.

Fault analysis methods can be used to analyze fault types and their impacts and hazards and to design corresponding reliability and availability solutions, providing capabilities such as redundancy, isolation, degradation, and elasticity. In addition, fault injection testing (FIT) can be used to verify the effectiveness of reliability and availability solutions, maximizing service reliability and availability.

For some faults, after redundancy and automatic recovery are performed using various technical means, services may still be interrupted, requiring manual intervention such as backup recovery or DR. Therefore, an efficient emergency recovery process and platform must be established to quickly recover services and reduce the impact of faults.

Overload Control

When the request load exceeds the system capacity, requests may fail due to resource saturation. In the cloud, the system and resource usage can be monitored, so resources can be automatically added or deleted to maintain the optimal level to meet service requirements, without causing over-configuration or under-configuration.

Generally, service traffic is adjusted through dynamic resource management. It is not recommended that static thresholds be used to prevent overload, because this may cause a huge resource waste. The following aspects must be considered during overload control design:

Dynamic rate limiting: The rate limit is dynamically adjusted based on the system resource usage.
Elastic scaling: The system automatically monitors the system resource usage and adds or deletes resources.
Load balancing before overload control: When multiple processing units are deployed, load balancing is preferentially considered to reduce service impacts caused by resource insufficiency of a single processing unit. Then, overload control is performed to maximize the processing capability of the entire system.
Early control: When the system is overloaded, service access should be controlled at the earliest possible stage of the process, in the front-end processing modules, or at the underlying protocol level to avoid unnecessary performance overhead caused by mid-process control.
Priority assurance: When the system is overloaded, services with higher priorities are preferentially processed to maximize social benefits.

Change Error Prevention

When the system is upgraded or the configuration is changed, human errors need to be prevented to avoid causing system and service impacts or failures.

Generally, mistake proofing is used to reduce human errors. Mistake proofing is a behavior constraint method that prevents errors. It allows operators to perform operations intuitively and accurately. Such operations do not require intense concentration, extensive experience, or professional knowledge. Mistake proofing improves efficiency and user experience in many scenarios and helps reduce damage or replacement costs. It is a basic and common feature in excellent products.

The following solutions are usually used to prevent change errors:

Role constraints: The permissions control is used to restrict the configuration scope of different roles, preventing errors caused by unauthorized configuration.
Separation of query and modification: The configuration options are divided into different layers through the product interface design, with the query and modification pages separated to reduce the risk of inadvertent configurations.
Configuration validation: The configuration validation mechanism is designed to ensure that necessary checks are performed before the configuration takes effect, preventing any configuration errors from taking effect. Automatic configuration changes reduce the possibility of manual errors.
Deletion protection: A protection mechanism is used to prevent resources from being deleted by mistake. For example, the status check before deletion, resource locking, and recycle bin mechanism.

Parent topic: Resilience Pillar

Previous topic: Availability Requirements

Next topic: Questions and Checklists