Updated on 2024-05-27 GMT+08:00

Alarm Noise Reduction

This section describes how to set alarm noise reduction. Before sending an alarm notification, AOM processes alarms based on noise reduction rules to prevent alarm storms.

Scenario

When analyzing applications, resources, and businesses, e-commerce O&M personnel find that the number of alarms is too large and there are too many identical alarms. They need to detect faults based on the alarms and monitor applications comprehensively.

Solution

Use AOM to set alarm rules to monitor the usage of resources (such as hosts and components) in the environment in real time. When AOM or an external service is abnormal, an alarm is triggered immediately. AOM also provides the alarm noise reduction function. Before sending an alarm notification, AOM processes alarms based on noise reduction rules. This helps you identify critical problems and avoid alarm storms.

Alarm noise reduction consists of four parts: grouping, deduplication, suppression, and silence.

  • You can filter different subnets of alarms and then group them according to certain conditions. Alarms in the same group are aggregated to trigger one notification.
  • By using suppression rules, you can suppress or block notifications related to specific alarms. For example, when a major alarm is generated, less severe alarms can be suppressed. Another example, when a node is faulty, all other alarms of the processes or containers on this node can be suppressed.
  • You can create a silence rule to shield alarm notifications in a specified period. The rule takes effect immediately after it is created.
  • AOM has built-in deduplication rules. The service backend automatically deduplicates alarms. You do not need to manually create rules.

Monitoring ELB metrics at the business layer is used as an example here.

Step 1: Create a Grouping Rule

When a critical or major alarm is generated, the Monitor_host action rule is triggered, and alarms are grouped by alarm source. To create a grouping rule, do as follows:

  1. Log in to the AOM 2.0 console.
  2. In the navigation pane, choose Alarm Management > Alarm Noise Reduction.
  3. On the Grouping Rules tab page, click Create and set the rule name and grouping condition.

    Figure 1 Creating a grouping rule
    Table 1 Alarm combination rule

    Combine Notifications

    Combines grouped alarms based on specified fields. Alarms in the same group are aggregated for sending one notification.

    Notifications can be combined:

    • By alarm source: Alarms triggered by the same alarm source are combined into one group for sending notifications.
    • By alarm source + severity: Alarms triggered by the same alarm source and of the same severity are combined into one group for sending notifications.
    • By alarm source + all tags: Alarms triggered by the same alarm source and with the same tag are combined into one group for sending notifications.

    Initial Wait Time

    Interval for sending an alarm notification after alarms are combined for the first time. It is recommended that the time be set to seconds to prevent alarm storms.

    Value range: 0s to 10 minutes. Recommended: 15s.

    Batch Processing Interval

    Waiting time for sending an alarm notification after the combined alarm data changes. It is recommended that the time be set to minutes. If you want to receive alarm notifications as soon as possible, set the time to seconds.

    The change here refers to a new alarm or an alarm status change.

    Value range: 5s to 30 minutes. Recommended: 60s.

    Repeat Interval

    Waiting time for sending an alarm notification after the combined alarm data becomes duplicate. It is recommended that the time be set to hours.

    Duplication means that no new alarm is generated and no alarm status is changed while other attributes (such as titles and content) are changed.

    Value range: 0 minutes to 15 days. Recommended: 1 hour.

Step 2: Create a Metric Alarm Rule (Configuration Mode Set to Select from all metrics)

You can set threshold conditions in metric alarm rules for resource metrics. If a metric value meets the threshold condition, a threshold alarm will be generated. If no metric data is reported, an insufficient data event will be generated.

Metric alarm rules can be created in three modes: Select by resource type, Select from all metrics, and PromQL. The following describes how to create an alarm rule for monitoring all metrics at the ELB business layer.

  1. Log in to the AOM 2.0 console.
  2. In the navigation pane, choose Alarm Management > Alarm Rules.
  3. On the Metric/Event Alarm Rules tab page, click Create Alarm Rule.
  4. Set the basic information about the alarm rule, such as the rule name.
  5. Set the detailed information about the alarm rule.

    1. Set Rule Type to Metric alarm rule and Configuration Mode to Select from all metrics.
    2. Set parameters such as the metric, environment, and check interval.
    3. Set alarm tags and annotations to group alarms. They can be associated with alarm noise reduction policies for sending notifications. As a business-layer metric is selected in 5.b, set Alarm Tag to aom_monitor_level:business.
      Figure 2 Customizing tag information

      The tag of full metrics is in the format of "key:value". Generally, key is set to aom_monitor_level. value varies depending on the layer of metrics:

      • Infrastructure metrics: infrastructure
      • Middleware metrics: middleware
      • Application metrics: application
      • Business metrics: business

  6. Set an alarm notification policy. There are two alarm notification modes. In this example, the alarm noise reduction mode is selected.

    Alarm noise reduction: Alarms are sent only after being processed based on noise reduction rules, preventing alarm storms.
    Figure 3 Selecting the alarm noise reduction mode

  7. Click Confirm. Then, click Back to Alarm Rule List to view the created alarm rule.

    As shown in the following figure, a metric alarm rule is created. Click in front of the rule name to view its details.

    Figure 4 Creating a metric alarm rule

    In the expanded list, if a metric value meets the configured alarm condition, a metric alarm is generated on the alarm page. To view the alarm, choose Alarm Management > Alarm List in the navigation pane.

    If the preset notification policy is met, the system sends an alarm notification to the specified personnel by email, SMS, or WeCom.