RES06-02 Fault Detection

Automatic detection is required for all fault scenarios to detect and rectify faults in a timely manner.

Risk level
High
Key strategies
- Detect all faults.
- Detect faults by region, AZ, service, method, instance, or container ID. Keep detection dimensions aligned with fault recovery modes.
- Ensure alarms are generated or faults are automatically rectified after faults are detected.
Based on detection types, fault detection can be classified into resource detection, function detection, and service detection.
- Resource detection detects virtualized physical hardware resources and corresponding software resources in a cloud environment, including CPU, memory, network, and disk resources.
- Function detection detects the internal modules of a product system to determine whether the module functions meet the design requirements. If a function of a product system is faulty, the function is not working as expected. Developers and testers must repeatedly verify module functionality before launching a product. Function detection can be performed using traditional log tracing and call chain technologies, such as Huawei Cloud APM.
- Service detection simulates user operation process to obtain the performance and operation result data of the process. Service detection is implemented using dialing test technologies. Dialing tests occupy network resources. Therefore, long-period dialing tests are usually performed during off-peak hours, which are sampling tests. Short-period dialing tests (for example, 5-minute dialing tests) can be performed regularly. Call chains can also be used for service detection.
There are many fault detection methods based on the fault type. The following describes some common fault detection methods for high availability systems.
- Value range check: In most applications, the result of an operation must be within a certain range. Such a range helps you verify whether the data meets the expected requirements.
- Data integrity check: Data can be corrupted during transfer, especially between hardware units. However, because the software layer can hide the differences between local memory transfers and transfers across remote links, data integrity checks must occur at multiple points. There are many ways to verify data integrity, most of which rely on redundancy or summary information contained in the data. Some methods use enough redundancy to detect errors and correct them. However, most methods only include enough extra information to detect whether the data is valid. Typical methods include parity check and cyclic redundancy check (CRC).
- Comparison test: If there is a redundant system, two systems run calculation in parallel and then their results are compared. If the results differ, there is a fault. This concept is also called voting. Comparisons can be made at any level of the system, including cycle-by-cycle comparisons on a memory bus and comparisons with results transferred over a network.
- Time detection is a basic fault detection method that identifies faults when expected events do not occur within a specified timeframe. A specialized approach, known as the heartbeat method, uses periodic message handshakes to verify whether a unit or subsystem maintains functional status.