Updated on 2024-06-26 GMT+08:00

Configuring Custom Alarms on CCE

If the default alarm rules cannot meet your requirements, you can create alarm rules on CCE. Based on the alarm rules, you can check whether resources in clusters are normal in a timely manner.

Adding Metric Alarms

  • To create Prometheus metric threshold-crossing alarm rules and metric alarm rules, you need to enable Monitoring Center. For details, see Enabling Cluster Monitoring.
  • Some metric templates are created based on the problems reported by CCE Node Problem Detector (CCE Node Problem Detector). For details about these metrics, see Table 1. To use related alarm rules, ensure that CCE Node Problem Detector has been installed and is running normally.
  1. Log in to the CCE console and click the cluster name to access the cluster console.
  2. In the navigation pane on the left, choose Alarm Center. Then, choose Alarm Rules > Custom Alarm Rules, and click Create Alarm Rule.
  3. Configure the alarm rule parameters.

    • Rule Type: Select Metric alarm.
    • Alarm Template: If you select No template, you need to configure the parameters in Rule Details. You can also set this parameter to Use template to quickly define a PromQL-based alarm rule or modify an existing template.
    • Rule Details: Configure the parameters listed in the following table.

      Parameter

      Description

      Example Value

      Rule Name

      Enter the name of the alarm rule.

      CoreDNS memory usage higher than 80%

      (Optional) Description

      Describe the alarm rule.

      Check whether the memory usage of CoreDNS is higher than 80%.

      Alarm Rule (PromQL)

      Enter a Prometheus query statement. For details about how to compile Prometheus query statements, see Query Examples.

      The following is an example statement for generating an alarm when the maximum memory usage of CoreDNS is higher than 80%:
      (sum(container_memory_working_set_bytes{image!="", container!="POD",namespace="kube-system",container="coredns"}) BY (cluster_name, node,container, pod , namespace, cluster) / sum(container_spec_memory_limit_bytes{namespace="kube-system", container="coredns"} > 0) BY (cluster_name, node, container, pod , namespace, cluster) * 100) > 80

      Severity

      Select Critical, Major, Minor, or Warning.

      Critical

      Duration

      Select an alarm duration from the drop-down list. The default value is 1 minute.

      1 minute

      Alarm Content

      Define the content in the alarm notification. Variables in Prometheus can be obtained in the form of ${variable}.

      Example:

      Cluster: ${cluster_name}, Namespace: ${namespace}, Pod: ${pod}, Container: ${container} memory usage is higher than 80%. The current value is ${value} %.

      Contact Group

      Select an existing contact group. You can also click Create Contact Group to create a contact group. For details about the parameters, see Configuring Alarm Notification Recipients.

      CCEGroup

      In the preceding example, an alarm rule named CoreDNS memory usage higher than 80% is set for CoreDNS in the kube-system namespace, and its severity is Critical. When the maximum memory usage is higher than 80% for 1 minute, a notification is sent to all alarm contacts in the CCEGroup contact group by SMS message or email. The notification contains the cluster name, namespace, pod name, container name, and current memory usage.

    • (Optional) Advanced Settings
      • Alarm Tag: An attribute for identifying and grouping alarms to reduce noise. In the message template, the tag value is referenced as $event.metadate. A maximum of 10 alarm tags can be added.
      • Alarm Annotation: An attribute that is not used for alarm identification. In the message template, the annotation value is referenced as $event.annotations. A maximum of 10 alarm annotations can be added.

  4. Click OK. Then, go to the Custom Alarm Rules page to check whether the rule is successfully created.

Adding Event Alarms

  • To create event-triggered alarm rules, you need to enable Logging and Kubernetes event collection. For details, see Collecting Container Logs Using Cloud Native Logging.
  • Some metric templates are created based on the problems reported by CCE Node Problem Detector (CCE Node Problem Detector). For details about these metrics, see Table 1. To use related alarm rules, ensure that CCE Node Problem Detector has been installed and is running normally.
  1. Log in to the CCE console and click the cluster name to access the cluster console.
  2. In the navigation pane on the left, choose Alarm Center. Then, choose Alarm Rules > Custom Alarm Rules, and click Create Alarm Rule.
  3. Configure the alarm rule parameters.

    • Rule Type: Select Event alarm. Common events include Kubernetes events and cloud service events.
    • Rule Details: Configure the parameters listed in the following table.

      Parameter

      Description

      Example Value

      Rule Name

      Enter the name of the alarm rule.

      ReplicaSet quantity change

      (Optional) Description

      Describe the alarm rule.

      The number of ReplicaSets changes more than three times within 5 minutes.

      Event Name

      Enter the event name based on the actual Kubernetes event or cloud service event. For details about event names, see CCE Events.

      ScalingReplicaSet

      Triggering Mode

      • Immediate trigger: An alarm is generated as long as the event occurs.
      • Accumulative trigger: An alarm is generated only after the event is triggered for a preset number of times within the triggering period.

      Select Accumulative trigger, and set Monitoring Interval to 5 minutes and Occurrences to > 3.

      Severity

      Select Critical, Major, Minor, or Warning.

      Minor

      Contact Group

      Select an existing contact group. You can also click Create Contact Group to create a contact group. For details about the parameters, see Configuring Alarm Notification Recipients.

      CCEGroup

      In the preceding example, an alarm named ReplicaSet quantity change is set for the ScalingReplicaSet event, and its severity is Minor. When the number of ReplicaSet changes more than three times within 5 minutes, a notification is sent to all alarm contacts in the CCEGroup by SMS or email.

  4. Click OK. Then, go to the Custom Alarm Rules page to check whether the rule is successfully created.