Updated on 2025-05-22 GMT+08:00

OPS07-01 Creating Alarms

  • Risk level

    High

  • Key strategies

    Respond to alarms based on the alarm operability principle. You do not need to respond to meaningless alarms, like a sudden increase in disk I/O volume or CPU usage. Following the operability principle can prevent many false alarms. In addition, you need to periodically collect statistics on and analyze alarm frequencies, identify high-frequency alarms, resolve the alarms, and clear the false alarms.

  • Design suggestions
    • Optimize alarm thresholds: Properly increase the memory, CPU, and network I/O alarm thresholds.
    • Optimize log levels: Optimize improper log levels, for example, change some ERROR logs to WARNING logs.
    • Shield some logs: For applications whose log levels are difficult to adjust, shield some frequent log alarms based on keywords.
    • Enhance warning: Provide warnings for some operations that affect service parties.
    • Enhance emergency warning: Some hardware faults will occur in /var/log/messages. Match hardware alarms based on keywords for timely handling.
  • Related cloud services and tools

    Application Operations Management (AOM)

    Cloud Operations Center (COC)

    Cloud Eye