Updated on 2025-05-22 GMT+08:00

RES06-03 Subhealth Detection

Components in a system can be fully faulty or in a subhealth state. Subhealth indicates that the overall service of a system remains within the threshold, but the service of some instances exceeds the threshold. Subhealth is a relative concept, which compares current performance with past data or overall system performance. Therefore, the detection and determination of subhealth vary. If subhealth is detected, a system needs to be isolated or recovered promptly to prevent service disruptions.

  • Risk level

    High

  • Key strategies

    Subhealth detection predicts system faults based on subhealth symptoms. A typical example is memory leakage. Memory leakage does not immediately cause system failures. The system becomes slow due to insufficient swap memory, and the memory usage keeps increasing. Therefore, monitoring the memory usage of instances is necessary. If the memory usage exceeds the threshold, an alarm is generated, and manual intervention is required to quickly rectify the fault, preventing service interruptions.

    Typical subhealth scenarios include packet loss or errors, hard disk performance deterioration, and CPU or memory overload. If a component in an application system is in subhealth state, the service success rate of the application system may decrease.

    Subhealth is not a fault. Therefore, thresholds are set for service monitoring metrics. When a metric exceeds the threshold, an alarm is generated and recovery is required.