Updated on 2024-08-05 GMT+08:00

Managing Containers

This section describes how to use AOM to quickly manage containers on the Overview page, including container monitoring and alarm rule creation. The procedure is as follows:

  1. Monitoring Containers: AOM is compatible with Kubernetes and automatically collects and reports container information.
  2. Setting an Alarm Rule: Create metric alarm rules to ensure that notifications are sent when containers are abnormal.
  3. Setting an Alarm Action Rule: Configure alarm action rules, for example, containers automatically restart when they become abnormal.

Monitoring Containers

  1. Log in to the AOM 2.0 console.
  2. In the navigation pane, choose Overview.
  3. Go to the By Container page.

  4. In the Getting Started area, click Monitor Container. The Workload Monitoring page is displayed.

  5. In the upper right corner of the page, set filter criteria.

    1. Set a time range to view the workloads reported. There are two methods to set a time range:

      Method 1: Use a predefined time label, such as Last hour, Last 6 hours, or Last day. Select one as required.

      Method 2: Specify the start time and end time (max. 30 days).

    2. Set the interval for refreshing information. Click and select a desired value from the drop-down list.

  6. Click any workload tab to view information, such as workload name, status, cluster, and namespace.

    • In the upper part of the workload list, filter workloads by cluster, namespace, or pod name.
    • Click in the upper right corner to obtain the latest workload information.
    • Click in the upper right corner and select or deselect the columns to display.
    • Click the name of a workload to view its details.
      • On the Pods tab page, view all pod conditions of the workload. Click a pod name to view the resource usage and health status of the pod's containers.
      • On the Monitoring Views tab page, view the resource usage of the workload.
      • On the Alarms tab page, view the alarm details of the workload.
      • On the Events tab page, view the event details of the workload.

Setting an Alarm Rule

Metric alarm rules can be created in three modes: Select by resource type, Select from all metrics, and PromQL.

The following uses Select from all metrics as an example.

  1. On the Overview page, switch to By Container.
  2. In the Getting Started area, click Set Alarm Rule. The Alarm Rules page is displayed.
  3. Click Create Alarm Rule.
  4. Set basic information about the alarm rule by referring to Table 1.

    Table 1 Basic information

    Parameter

    Description

    Rule Name

    Name of a rule. Enter a maximum of 256 characters and do not start or end with any special character. Only letters, digits, underscores (_), and hyphens (-) are allowed.

    Enterprise Project

    Enterprise project.

    • If you have selected All for Enterprise Project on the global settings page, select one from the drop-down list here.
    • If you have already selected an enterprise project on the global settings page, this option will be grayed and cannot be changed.
      NOTE:

      To use the enterprise project function, contact engineers.

    Description

    Description of the rule. Enter up to 1,024 characters.

  5. Set the detailed information about the alarm rule.

    1. Set Rule Type to Metric alarm rule.
    2. Set Configuration Mode to Select from all metrics.
    3. Select a target Prometheus instance from the drop-down list.
    4. Set alarm rule details. Table 2 describes the parameters.

      After the setting is complete, the monitored metric data is displayed in a line graph above the alarm condition. Click the line icon before each metric data record to hide the metric data in the graph. You can click Add Metric to add metrics and set the statistical period and detection rules for the metrics.

      After moving the cursor to the metric data and the corresponding alarm condition, you can perform the following operations as required:

      • Click next to an alarm condition to hide the corresponding metric data record in the graph.
      • Click next to an alarm condition to convert the metric data and alarm condition into a Prometheus command.
      • Click next to an alarm condition to quickly copy the metric data and alarm condition and modify them as required.
      • Click next to an alarm condition to remove a metric data record from monitoring.
      Table 2 Alarm rule details

      Parameter

      Description

      Multiple Metrics

      Calculation is performed based on the preset alarm conditions one by one. An alarm is triggered when one of the conditions is met.

      For example, if three alarm conditions are set, the system performs calculation respectively. If any of the conditions is met, an alarm will be triggered.

      Combined Operations

      The system performs calculation based on the expression you set. If the condition is met, an alarm will be triggered.

      For example, if there is no metric showing the CPU core usage of a host, do as follows:

      • Set the metric of alarm condition "a" to aom_node_cpu_used_core and retain the default values for other parameters. This metric is used to count the number of CPU cores used by a measured object.
      • Set the metric of alarm condition "b" to aom_node_cpu_limit_core and retain the default values for other parameters. This metric is used to count the total number of CPU cores that have been applied for a measured object.
      • If the expression is set to "a/b", the CPU core usage of the host can be obtained.
      • Set Rule to Max > 0.2.
      • In the trigger condition, set Consecutive Periods to 3.
      • Set Alarm Severity to Critical.

      If the maximum CPU core usage of a host is greater than 0.2 for three consecutive periods, a critical alarm will be generated.

      Metric

      Metric to be monitored. When Select from all metrics is selected, enter keywords to search for metrics.

      Click the Metric text box. In the resource tree on the right, you can also select a target metric by resource type.

      Statistical Period

      Metric data is aggregated based on the configured statistical period, which can be 1 minute, 5 minutes, 15 minutes, or 1 hour.

      Condition

      Metric monitoring scope. If this parameter is left blank, all resources are covered.

      Each condition is in a key-value pair. You can select a dimension name from the drop-down list. The dimension value varies according to the matching mode.

      • =: Select a dimension value from the drop-down list. For example, if Dimension Name is set to Host name and Dimension Value is set to 192.168.16.4, only host 192.168.16.4 will be monitored.
      • !=: Select a dimension value from the drop-down list. For example, if Dimension Name is set to Host name and Dimension Value is set to 192.168.16.4, all hosts excluding host 192.168.16.4 will be monitored.
      • =~: The dimension value is determined based on one or more regular expressions. Separate regular expressions by vertical bar (|). For example, if Dimension Name is set to Host name and Regular Expression is set to 192.*|172.*, only hosts whose names are 192.* and 172.* will be monitored.
      • !~: The dimension value is determined based on one or more regular expressions. Separate regular expressions by vertical bar (|). For example, if Dimension Name is set to Host name and Regular Expression is set to 192.*|172.*, all hosts excluding hosts 192.* and 172.* will be monitored.

      For details about how to enter a regular expression, see Regular Expression Examples.

      You can also click and select AND or OR to add more conditions for the metric.

      Grouping Condition

      Aggregate metric data by the specified field and calculate the aggregation result. Options: Not grouped, avg by, max by, min by, and sum by. For example, avg by clusterName indicates that metrics are grouped by cluster name, and the average value of the grouped metrics is calculated and displayed in the graph.

      Rule

      Detection rule of a metric alarm, which consists of the statistical mode (Avg, Min, Max, Sum, and Samples), determination criterion (, , >, and <), and threshold value. For example, if the detection rule is set to Avg >10, a metric alarm will be generated if the average metric value is greater than 10.

      Trigger Condition

      When the metric value meets the alarm condition for a specified number of consecutive periods, a metric alarm will be generated. Range: 1 to 30.

      For example, if Consecutive Periods is set to 2, a metric alarm will be triggered if the trigger condition is met for two consecutive periods.

      Alarm Severity

      Severity of a metric alarm. Options: Critical, Major, Minor, and Warning.

  6. Click Advanced Settings and set information such as Check Interval and Alarm Clearance. For details about the parameters, see Table 3.

    Table 3 Advanced settings

    Parameter

    Description

    Check Interval

    Interval at which metric query and analysis results are checked.

    • Hourly: Query and analysis results are checked every hour.
    • Daily: Query and analysis results are checked at a fixed time every day.
    • Weekly: Query and analysis results are checked at a fixed time point on a specified day of a week.
    • Custom interval: The query and analysis results are checked at a fixed interval.
    • Cron: A cron expression is used to specify a time interval. Query and analysis results are checked at the specified interval.

      The time specified in the cron expression can be accurate to the minute and must be in the 24-hour notation. Example: 0/5 * * * *, which indicates that the check starts from 0th minute and is performed every 5 minutes.

    Alarm Clearance

    The alarm will be cleared when the alarm condition is not met for a specified number of consecutive periods. By default, metrics in only one period are monitored. You can set up to 30 consecutive monitoring periods.

    For example, if Consecutive Periods is set to 2, the alarm will be cleared when the alarm condition is not met for two consecutive periods.

    Action Taken for Insufficient Data

    Action to be taken when no metric data is generated or metric data is insufficient within the monitoring period. You can set this option based on your requirements.

    By default, metrics in only one period are monitored. You can set up to five consecutive monitoring periods.

    The system supports the following actions: changing the status to Exceeded and sending an alarm, changing the status to Insufficient data and sending an event, maintaining Previous status, and changing the status to Normal and sending an alarm clearance notification.

    Alarm Tag

    Click to add an alarm tag. It is an alarm identification attribute in the format of "key:value". It is used in alarm noise reduction scenarios.

    For details, see Alarm Tags and Annotations.

    Alarm Annotation

    Click to add an alarm annotation. Alarm non-identification attribute in the format of "key:value". It is used in alarm notification and message template scenarios.

    For details, see Alarm Tags and Annotations.

  7. Set an alarm notification policy. For details, see Table 4.

    Table 4 Parameters for setting an alarm notification policy

    Parameter

    Description

    Notify When

    Set the scenario for sending alarm notifications.

    • Alarm triggered: If the alarm trigger condition is met, the system sends an alarm notification to the specified personnel by email or SMS.
    • Alarm cleared: If the alarm clearance condition is met, the system sends an alarm notification to the specified personnel by email or SMS.

    Alarm Mode

    • Direct alarm reporting: An alarm is directly sent when the alarm condition is met. If you select this mode, set an interval for notification and specify whether to enable an action rule.

      Frequency: frequency for sending alarm notifications. Select a desire value from the drop-down list.

      If you enable this function, the system sends notifications based on the associated SMN topic and message template. If the existing alarm action rules cannot meet your requirements, click Create Rule in the drop-down list to create one. For details about how to set alarm action rules, see Setting an Alarm Action Rule.

    • Alarm noise reduction: Alarms are sent only after being processed based on noise reduction rules, preventing alarm storms.

      If you select this mode, the silence rule is enabled by default. You can determine whether to enable Grouping Rule as required. If you enable this function, select a grouping rule from the drop-down list. If the existing grouping rules cannot meet your requirements, click Create Rule in the drop-down list to create one.

  8. Click Confirm. Then click View Rule to view the created alarm rule.

    In the expanded list, if a metric value meets the configured alarm condition, a metric alarm is generated on the alarm page. To view it, choose Alarm Management > Alarm List in the navigation pane. If a metric value meets the preset notification policy, the system sends an alarm notification to the specified personnel by email or SMS.

Setting an Alarm Action Rule

  1. Go to the By Container page.
  2. In the Getting Started area, click Set Alarm Action Rule. The Alarm Action Rules page is displayed.
  3. On the Action Rules tab page, click Create.
  4. Set parameters such as Rule Name and Action Type by referring to Table 5.

    Table 5 Parameters for creating an alarm action rule

    Parameter

    Description

    Rule Name

    Name of an action rule. Enter up to 100 characters and do not start or end with an underscore (_) or hyphen (-). Only digits, letters, underscores, and hyphens are allowed.

    Enterprise Project

    Enterprise project.

    • If you have selected All for Enterprise Project on the global settings page, select one from the drop-down list here.
    • If you have already selected an enterprise project on the global settings page, this option will be grayed and cannot be changed.
      NOTE:

      To use the enterprise project function, contact engineers.

    Description

    Description of the action rule. Enter up to 1,024 characters.

    Action Type

    Type of an alarm action rule. Only Metric/Event is supported.

    Action

    Type of action that is associated with the SMN topic and message template. Select one from the drop-down list. Only Notification is supported.

    Topic

    SMN topic. Select your desired topic from the drop-down list.

    If there is no topic you want to select, create one on the SMN console.

    Message Template

    Notification message template. Select your desired template from the drop-down list.

    If no proper message template is available, click Create Template to create a message template.

  5. Click OK.