Updated on 2024-06-26 GMT+08:00

Configuring Custom Alarms on AOM

CCE interworks with AOM to report alarms and events. By setting alarm rules on AOM, you can check whether resources in clusters are normal in a timely manner.

Process

  1. Creating a Topic on SMN
  2. Creating an Action Rule
  3. Adding an Alarm Rule
    1. Event alarms: Generate alarms based on the events reported by clusters to AOM. For details about the events and configurations, see Adding an Event Alarm.
    2. Metric alarms: Generate alarms based on the thresholds of monitoring metrics, such as resource utilization of servers and components. For details about the metric thresholds and configurations, see Adding a Metric Alarm.

Creating a Topic on SMN

Simple Message Notification (SMN) pushes messages to subscribers through emails, SMS messages, and HTTP/HTTPS requests.

A topic is used to publish messages and subscribe to notifications. It serves as a message transmission channel between publishers and subscribers.

You need to create a topic and add a subscription to it. For details, see Creating a Topic and Adding a Subscription to a Topic.

After subscribing to a topic, confirm the subscription in the email or SMS message for the notification to take effect.

Creating an Action Rule

AOM allows you to customize alarm action rules. You can create an alarm action rule to associate an SMN topic with a message template. You can also customize notification content based on a message template.

For details, see Creating an Alarm Action Rule. When creating an action rule, select the topic that is created and subscribed to in Creating a Topic on SMN.

Adding an Event Alarm

The following uses NodeNotReady as an example to describe how to add an event alarm. You can add other alarms by referring to Table 1.

Table 1 Event-based alarms

Event Name

Source

Description

Solution

NodeNotReady

CCE

An alarm is triggered immediately when a node is abnormal.

Log in to the cluster and check the status of the node for which the alarm is generated. Set the node as unschedulable and schedule the service pods to another node.

Rebooted

CCE

An alarm is triggered immediately when a node is restarted.

Log in to the cluster to check the status of the node for which the alarm is generated, check whether the node can be started properly, and locate the cause of the restart.

KUBELETIsDown

CCE

An alarm is triggered immediately when a node is abnormal.

Log in to the cluster and check the status of the node for which the alarm is generated. Set the node as unschedulable and schedule the service pods to another node. Then, restart kubelet.

DOCKERIsDown

CCE

An alarm is triggered immediately when a node is abnormal.

Log in to the cluster and check the status of the node for which the alarm is generated. Set the node as unschedulable and schedule the service pods to another node. Then, restart Docker.

KUBEPROXYIsDown

CCE

An alarm is triggered immediately when a node is abnormal.

Log in to the cluster and check the status of the node for which the alarm is generated. Set the node as unschedulable and schedule the service pods to another node.

KernelOops

CCE

An alarm is triggered immediately when a node is abnormal.

Log in to the cluster and check the status of the node for which the alarm is generated. Set the node as unschedulable and schedule the service pods to another node.

ConntrackFull

CCE

An alarm is triggered immediately when a node is abnormal.

Log in to the cluster and check the status of the node for which the alarm is generated. Set the node as unschedulable and schedule the service pods to another node.

NodePoolSoldOut

CCE

An alarm is triggered immediately when node pool resources are sold out.

Set auto node pool switchover or change the node pool specifications.

NodeCreateFailed

CCE

An alarm is triggered immediately upon a node creation failure.

Rectify the failure and create the node again.

ScaleUpTimedOut

CCE

An alarm is triggered immediately upon node scale-out timeout.

Rectify the failure and try scale-out again.

ScaleDownFailed

CCE

An alarm is triggered immediately upon node scale-in timeout.

Rectify the failure and try scale-in again.

BackOffPullImage

CCE

Image pull retry failed.

Log in to the cluster, locate the failure cause, and deploy the service workload again.

  1. Log in to the AOM 2.0 console.
  2. In the navigation pane on the left, choose Alarm Management > Alarm Rules. Then, click Create Alarm Rule.
  3. Enter basic information as prompted and configure other parameters as follows:

    For details about parameters, see Creating an Event Alarm Rule.

    • Rule Type: Select Event alarm rule.
    • Event Type: Select System.
    • Event Source: Select CCE.
    • Monitored Object: Filter monitored objects by notification type, event name, alarm severity, custom attribute, namespace, and cluster name.

      In this example, filter monitored objects by event name, select NodeNotReady, and set Trigger Mode to Immediate Trigger.

    • Alarm Mode: Select Direct alarm reporting.
    • Action Rule: Select the action rule created in Creating an Action Rule.

    Configure other parameters as required.

    In this example, the alarm settings are as follows:

    If a node in the cluster becomes abnormal, CCE reports the NodeNotReady event to AOM. AOM immediately notifies you through SMN based on the action rule.

    Figure 1 Adding an event alarm

  4. Click Confirm.

    A successfully created alarm rule will be displayed in the rule list.

Adding a Metric Alarm

The following uses a PromQL statement as an example to describe how to add a metric alarm.

  1. Log in to the AOM 2.0 console.
  2. In the navigation pane on the left, choose Alarm Management > Alarm Rules. Then, click Create Alarm Rule.
  3. Configure parameters as follows:

    For details about parameters, see Creating a Metric Alarm Rule.

    • Rule Type: Select Metric alarm rule.
    • Configuration Mode: Select PromQL. You can enter native PromQL statements or use CCE templates.
    • Prometheus Instance: Select the AOM instance whose metrics are reported by Cloud Native Cluster Monitoring in the cluster.
    • Default Rule:
      • Custom: Enter a PromQL statement to configure the alarm rule. For example:
        kube_persistentvolume_status_phase{phase=~"Failed|Pending",cluster="${cluster_id}"} > 0

        ${cluster_id} indicates the cluster name. If a PV in the cluster is in the Failed or Pending state, an alarm will be generated.

      • CCEFromProm: Select an alarm template provided by CCE.
        Figure 2 Adding a metric alarm
    • Alarm Mode: Select Direct alarm reporting.
    • Action Rule: Select the action rule created in Creating an Action Rule.

    Configure other parameters as required.

  4. Click Confirm.

    A successfully created alarm rule will be displayed in the rule list.