RES06-01 Fault Mode Analysis
Fault mode analysis is a design method that analyzes all potential fault modes of each component and their impact on product functions, classifies each potential fault mode based on its severity, identifies single points of failure and product weaknesses, and proposes preventive measures during system analysis and design to improve product reliability.
Huawei Cloud manages infrastructure faults. If application systems are deployed on Huawei Cloud, you do not need to pay much attention to detect and rectify faults of infrastructure such as data enters, power supplies, environments, compute servers, storage devices, and network switches. However, the impact of these infrastructure faults on application systems and the corresponding recovery measures must be considered, including data center (AZ- or region-level disaster) disasters, compute server failures or restarts, hard disk issues or subhealth, network communication interruptions, and packet loss. For application-related fault modes, such as software system faults, data faults, communication faults, overloads, and human errors, sufficient analysis and detection and recovery measures must be provided.
- Risk level
High
- Key strategies
Analyze the occurrence frequency and impact of each fault mode to determine the severity level. For fault modes of components with single points of failure, the severity must be high. Common cloud service fault modes include CPU overload, memory overload, high disk usage, data issues (such as accidental deletion), AZ failures, and region failures.
- Define severity levels.
Severity measures the adverse impact of a fault on a system. The severity has four levels, including critical, major, minor, and warning.
- Critical: Faults of this level will cause a system crash or severely affect the key functions of the system.
- Major: Faults of this level will affect the key functions of a system, cause task delay, or cause minor system damage or severe potential faults.
- Minor: Faults of this level will impair the secondary functions of a system or make these functions unavailable, but do not impact key functions. Immediate rectification is required.
- Warning: Faults of this level will impair some secondary functions of a system (regular maintenance is required), but will not affect the key functions of the system. These faults include alarms or indicator failures.
Critical and major faults are also called single points of failures (SPOFs). Critical faults may also arise the security concerns and cause failures in all or most functions. Major faults mainly affect the major functions. Minor faults require immediate rectification.
Generally, if a fault cannot be detected, it is considered as a potential fault, whose severity level should be increased by one level.
- Identify all components and functional modules in a system.
Identify all components and external dependencies of an application system, such as providers and third-party services.
- Identify faults.
Identify potential faults for each component. A single component may have multiple fault modes, and each fault mode must be analyzed. All fault modes must be included. If a fault mode is missing, it may not be considered in the design, leaving it unmonitored with no recovery solutions in place.
- Analyze fault impact scope (blast radius).
Analyze the occurrence frequency and impact of each fault mode to determine the severity level. For fault modes of components with single points of failure, the severity must be high. Common cloud service fault modes include CPU overload, memory overload, high disk usage, data issues (such as accidental deletion), AZ failures, and region failures.
- Provide fault detection and mitigation measures.
- For each fault mode, analyze how to detect and rectify the fault, propose improvement suggestions, comprehensively consider the system complexity and cost, and preferentially resolve the fault mode with a higher severity level.
- Define severity levels.
- Related cloud services and tools
- Cloud Operations Center (COC): supports fault mode management.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot