Updated on 2024-04-25 GMT+08:00

Configuring Custom Alarms on AOM

CCE interworks with AOM to report alarms and events. By setting alarm rules on AOM, you can check whether resources in clusters are normal in a timely manner.

Process

  1. Creating a Topic on SMN
  2. Creating an Action Policy
  3. Adding an Alarm Rule
    1. Event alarms: Generate alarms based on the events reported by clusters to AOM. For details about the events and configurations, see Adding Event Alarms.
    2. Metric alarms: Generate alarms based on the thresholds of monitoring metrics, such as resource utilization of servers and components. For details about the metric thresholds and configurations, see Adding Metric Alarms.

Creating a Topic on SMN

Simple Message Notification (SMN) pushes messages to subscribers through emails, SMS messages, and HTTP/HTTPS requests.

A topic is used to publish messages and subscribe to notifications. It serves as a message transmission channel between publishers and subscribers.

You need to create a topic and add a subscription to it.

After subscribing to a topic, confirm the subscription in the email or SMS message for the notification to take effect.

Creating an Action Policy

AOM allows you to customize alarm action policies. You can create an alarm action policy to associate an SMN topic and a message template. You can also customize notification content by using a message template.

For details, see Creating Alarm Action Policies. When creating an action policy, select the topic that is created and subscribed to in Creating a Topic on SMN.

Adding Event Alarms

The following uses the NodeNotReady alarm as an example to describe how to add an event alarm.

This function is provided by AOM. For details about parameters, see Creating an Event Alarm Rule.

Table 1 Event-based alarms

Event Name

Source

Description

Solution

NodeNotReady

CCE

An alarm is triggered immediately when a node is abnormal.

Log in to the cluster and check the status of the node for which the alarm is generated. Set the node as unschedulable and schedule the service pods to another node.

Rebooted

CCE

An alarm is triggered immediately when a node is restarted.

Log in to the cluster to check the status of the node for which the alarm is generated, check whether the node can be started properly, and locate the cause of the restart.

KUBELETIsDown

CCE

An alarm is triggered immediately when a node is abnormal.

Log in to the cluster and check the status of the node for which the alarm is generated. Set the node as unschedulable and schedule the service pods to another node. Then, restart kubelet.

DOCKERIsDown

CCE

An alarm is triggered immediately when a node is abnormal.

Log in to the cluster and check the status of the node for which the alarm is generated. Set the node as unschedulable and schedule the service pods to another node. Then, restart Docker.

KUBEPROXYIsDown

CCE

An alarm is triggered immediately when a node is abnormal.

Log in to the cluster and check the status of the node for which the alarm is generated. Set the node as unschedulable and schedule the service pods to another node.

KernelOops

CCE

An alarm is triggered immediately when a node is abnormal.

Log in to the cluster and check the status of the node for which the alarm is generated. Set the node as unschedulable and schedule the service pods to another node.

ConntrackFull

CCE

An alarm is triggered immediately when a node is abnormal.

Log in to the cluster and check the status of the node for which the alarm is generated. Set the node as unschedulable and schedule the service pods to another node.

NodePoolSoldOut

CCE

An alarm is triggered immediately when node pool resources are sold out.

Set auto node pool switchover or change the node pool specifications.

NodeCreateFailed

CCE

An alarm is triggered immediately upon a node creation failure.

Rectify the failure and create the node again.

ScaleUpTimedOut

CCE

An alarm is triggered immediately upon node scale-out timeout.

Rectify the failure and try scale-out again.

ScaleDownFailed

CCE

An alarm is triggered immediately upon node scale-in timeout.

Rectify the failure and try scale-in again.

BackOffPullImage

CCE

Image pull retry failed.

Log in to the cluster, locate the failure cause, and deploy the service workload again.

  1. Log in to the AOM console.
  2. In the navigation pane, choose Alarm Center > Alarm Rules and click Add Alarm.
  3. Configure an alarm rule.

    • Rule Type: Select Event alarm.
    • Alarm Source: Select CCE.
    • Select Object: Select Event Name and then NodeNotReady. You can filter trigger objects by notification type, event name, alarm severity, custom attribute, namespace, and cluster name.
    • Triggering Policy: Select Immediate Triggering.
    • Alarm Mode: Select Direct Alarm Reporting.
    • Action Policy: Select the action policy created in Creating an Action Policy.

    This alarm rule works as follows:

    If a node in the cluster becomes abnormal, CCE reports the NodeNotReady event to AOM. AOM immediately notifies you through SMN based on the action policy.

    Figure 1 Creating an event alarm

  4. Click Create Now.

    If the following information is displayed in the rule list, the rule is created successfully.

CCE Events

Event alarms are generated based on the events reported by CCE to AOM. CCE reports a series of events to AOM. You can view specific events in the Alarm Rule Settings areas and add event alarms as required.

Figure 2 Events reported by CCE

Data plane events and control plane events can be reported for CCE clusters.

Table 2 Data plane events

Type

Event Name

Severity

Remarks

Pod

PodOOMKilling

Major

Check whether the pod exits due to OOM.

This event is reported by CCE Node Problem Detector (1.18.41 or later) and Cloud Native Logging (1.3.2 or later).

Pod

FailedStart

Major

Check whether the pod is started.

Pod

FailedPullImage

Major

Check whether the pod has pulled an image.

Pod

BackOffStart

Major

Check whether the pod fails to be restarted.

Pod

FailedScheduling

Major

Check whether the pod is scheduled.

Pod

BackOffPullImage

Major

Check whether the pod has pulled an image after a retry.

Pod

FailedCreate

Major

Check whether a pod is created.

Pod

Unhealthy

Minor

Check whether the pod is running normally.

Pod

FailedDelete

Minor

Check whether the workload is deleted.

Pod

ErrImageNeverPull

Minor

Check whether the workload has pulled an image.

Pod

FailedScaleOut

Minor

Check whether workload copies are scaled out.

Pod

FailedStandBy

Minor

Check whether the pod enters the standby state.

Pod

FailedReconfig

Minor

Check whether the pod configuration is updated.

Pod

FailedActive

Minor

Check whether the pod is activated.

Pod

FailedRollback

Minor

Check whether the pod is rolled back.

Pod

FailedUpdate

Minor

Check whether the pod is updated.

Pod

FailedScaleIn

Minor

Check whether a pod scale-in failed.

Pod

FailedRestart

Minor

Check whether the pod is restarted.

Deployment

SelectorOverlap

Minor

Check whether label selectors in the cluster conflict.

Deployment

ReplicaSetCreateError

Minor

Check whether a workload ReplicaSet can be created.

Deployment

DeploymentRollbackRevisionNotFound

Minor

Check whether the Deployment rollback version is available.

DaemonSet

SelectingAll

Minor

Check whether the workload label selector is correctly configured.

Job

TooManyActivePods

Minor

Check whether there are still active pods after the number of pods in a job reaches the preset value.

Job

TooManySucceededPods

Minor

Check whether there are extra running pods after the number of pods in a job reaches the preset value.

CronJob

FailedGet

Minor

Check whether CronJobs can be obtained.

CronJob

FailedList

Minor

Check whether pods can be obtained.

CronJob

UnexpectedJob

Minor

Check whether there are any unknown CronJobs.

Service

CreatingLoadBalancerFailed

Minor

Check whether a load balancer is created.

Service

DeletingLoadBalancerFailed

Minor

Check whether the load balancer is deleted.

Service

UpdateLoadBalancerFailed

Minor

Check whether the load balancer is updated.

Namespace

DeleteNodeWithNoServer

Minor

Check whether discarded nodes are cleared.

PV

DetachVolumeFailed

Minor

Check whether the block storage is detached.

PV

VolumeUnknownReclaimPolicy

Minor

Check whether a volume reclamation policy is specified.

PV

SetUpAtVolumeFailed

Minor

Check whether the data volume is mounted.

PV

VolumeFailedRecycle

Minor

Check whether the data volume is reclaimed.

PV

WaitForAttachVolumeFailed

Minor

Check whether block storage is attached to the node.

PV

VolumeFailedDelete

Minor

Check whether the data volume is deleted.

PV

MountDeviceFailed

Minor

Check whether the data volume is mounted.

PV

TearDownAtVolumeFailed

Minor

Check whether the data volume is unmounted.

PV

UnmountDeviceFailed

Minor

Check whether the drive letter of the data volume is unmounted.

PV

AttachVolumeFailed

Minor

Check whether block storage is detached from the node.

PVC

VolumeResizeFailed

Minor

Check whether the capacity of the data volume is expanded.

PVC

ClaimLost

Minor

Check whether the PVC volume is working properly.

PVC

ProvisioningFailed

Minor

Check whether the data volume is created.

PVC

ProvisioningCleanupFailed

Minor

Check whether the data volume has been cleared.

PVC

ClaimMisbound

Minor

Check whether the PVC is bound to an incorrect volume.

Node

Rebooted

Major

Check whether the node is restarted.

Node

NodeNotSchedulable

Major

Check whether the node is schedulable.

Node

NodeNotReady

Major

Check whether the node is running normally.

Node

NodeCreateFailed

Major

Check whether the node is created.

Node

KUBELETIsDown

Minor

Check the kubelet status on the node.

Node

NodeHasInsufficientMemory

Minor

Check whether the available memory of the node is sufficient.

Node

UnregisterNetDevice

Minor

Check whether the node is associated with any unregistered network device.

Node

NetworkCardNotFound

Minor

Check the node ENI status.

Node

KUBEPROXYIsDown

Minor

Check whether kube-proxy is running normally on the node.

Node

NodeOutOfDisk

Minor

Check whether the node disk space is sufficient.

Node

TaskHung

Minor

Check whether there are any suspended tasks on the node.

Node

CIDRNotAvailable

Minor

Check whether the node CIDR block is available.

Node

ConntrackFull

Minor

Check whether the node conntrack table is full.

Node

NodeHasDiskPressure

Minor

Check whether the node disk space is sufficient.

Node

NodeInstallFailed

Minor

Check whether nodes are managed in the cluster.

Node

KernelOops

Minor

Check whether the OS kernel of the node is faulty.

Node

OOMKilling

Minor

Check whether OOM occurs on the node.

Node

DOCKERIsDown

Minor

Check whether the container engine of the node is working properly.

Node

CIDRAssignmentFailed

Minor

Check whether a CIDR block is allocated for the node.

Node

DockerHung

Minor

Check whether the Docker process on the node is suspended.

Node

FilesystemIsReadOnly

Minor

Check whether the file system of the node is read-only.

Node

NTPIsDown

Minor

Check whether NTP is running normally on the node.

Node

NodeUninstallFailed

Minor

Check whether the node is uninstalled.

Node

AUFSUmountHung

Minor

Check whether detaching the node disk is suspended.

Node

CNIIsDown

Minor

Check whether the CNI add-on on the node is faulty.

Autoscaler

ScaleUpTimedOut

Major

Check whether adding nodes to the node pool timed out.

Autoscaler

NodePoolAvailable

Major

Check whether the node pool resources are sufficient.

Autoscaler

ScaleDown

Major

Nodes are being deleted from the cluster.

Autoscaler

NotTriggerScaleUp

Major

Check whether a node scale-out is triggered.

Autoscaler

DeleteUnregistered

Major

Check whether unregistered nodes are deleted.

Autoscaler

ScaleDownEmpty

Major

Check whether idle nodes are scaled in.

Autoscaler

ScaleDownFailed

Major

Check whether nodes are scaled in.

Autoscaler

FailedToScaleUpGroup

Major

Check whether an error occurred during a node pool scale-out.

Autoscaler

ScaledUpGroup

Major

Check whether the node pool is scaled out.

Autoscaler

ScaleUpFailed

Major

Check whether the node is scaled out.

Autoscaler

FixNodeGroupSizeDone

Major

Check whether the number of nodes in the node pool is restored.

Autoscaler

NodeGroupInBackOff

Major

Check whether there are any rollback retries during node pool scaling.

Autoscaler

FixNodeGroupSizeError

Major

Check whether the number of nodes in the node pool is restored.

Autoscaler

NodePoolSoldOut

Major

Check whether the node pool resources are sufficient.

Autoscaler

TriggeredScaleUp

Major

Check whether a node scale-out is triggered.

Autoscaler

StartScaledUpGroup

Major

Check whether a node pool scaled-out is started.

Autoscaler

DeleteUnregisteredFailed

Major

Check whether unregistered nodes are deleted.

Table 3 Control plane events

Event ID

Severity

Description

Internal error

Major

Check whether an internal error occurs in the cluster.

External dependency error

Major

Check whether an error occurs in cluster external dependencies.

Failed to initialize process thread

Major

Check whether a cluster initialization thread is executed.

Failed to update database

Major

Check whether the database for the cluster is updated.

Failed to create node by nodepool

Major

Check whether nodes are created in the node pool.

Failed to delete node by nodepool

Major

Check whether nodes are deleted from the node pool.

Failed to create yearly/monthly subscription node

Major

Check whether the yearly/monthly node is created in the cluster.

Failed to cancel the authorization of accessing the image of the master.

Major

When creating a cluster, check whether the authorization for the resource tenant to access the master node image is canceled.

Failed to create the virtual IP for the master

Major

When creating a cluster, check whether the virtual IP address is allocated.

Failed to delete the node VM

Major

Check whether the node (VM) is deleted from the cluster.

Failed to delete the security group of node

Major

Check whether the security group of the node is deleted from the cluster.

Failed to delete the security group of master

Major

Check whether the security group of the master node is deleted from the cluster.

Failed to delete the security group of port

Major

Check whether the ENI security group of the master node is deleted from the cluster.

Failed to delete the security group of eni or subeni

Major

Check whether ENI or sub-ENI security group is deleted from the cluster.

Failed to detach the port of master

Major

Check whether the ENI of the master node is unbound from the cluster.

Failed to delete the port of master

Major

Check whether the ENI of the master node is deleted from the cluster.

Failed to delete the master VM

Major

Check whether master node (VM) is deleted from the cluster.

Failed to delete the key pair of master

Major

Check whether the key pair of the master node is deleted from the cluster.

Failed to delete the subnet of master

Major

Check whether the subnet of the master node is deleted from the cluster.

Failed to delete the VPC of master

Major

Check whether the VPC of the master node is deleted from the cluster.

Failed to delete certificate of cluster

Major

Check whether the certificate is deleted from the cluster.

Failed to delete the server group of master

Major

Check whether the master node (ECS) is deleted from the cluster.

Failed to delete the virtual IP for the master

Major

Check whether the virtual IP address is deleted from the cluster.

Failed to get floating IP of the master

Major

Check whether the floating IP address of the master node is obtained.

Failed to get cluster flavor

Major

Check whether the cluster flavor is obtained.

Failed to get cluster endpoint

Major

Check whether the cluster endpoint is obtained.

Failed to get Kubernetes connection

Major

Check whether the Kubernetes cluster connections are obtained.

Failed to update secret

Major

Check whether the cluster Secret is updated.

Operation timed out

Major

Check whether the user operation timed out.

Connecting to Kubernetes cluster timed out

Major

Check whether accessing the Kubernetes cluster timed out.

Failed to check component status or components are abnormal

Major

Check whether the statuses of cluster components can be obtained or whether the components malfunction.

The node is not found in kubernetes cluster

Major

Check whether the node can be found in the Kubernetes cluster.

The status of node is not ready in kubernetes cluster

Major

Check whether the node is running properly in the Kubernetes cluster.

Can't find corresponding vm of this node in ECS

Major

Check whether the node can be found on the ECS console.

Failed to upgrade the master

Major

Check whether the master node has been upgraded.

Failed to upgrade the node

Major

Check whether the node has been upgraded.

Failed to change flavor of the master

Major

Check whether the master node flavor has been changed.

Change flavor of the master timeout

Major

Check whether changing the master node flavor timed out.

Failed to pass verification while creating yearly/monthly subscription node

Major

Check whether creating a yearly/monthly node has been verified.

Failed to install the node

Major

Check whether the node is installed in the cluster.

Failed to clean routes of cluster container network in VPC

Major

Check whether the routes of cluster container VPCs are cleaned.

Cluster status is Unavailable

Major

Check whether the cluster is available.

Cluster status is Error

Major

Check whether the cluster is faulty.

Cluster status is not updated for a long time

Major

Check whether the cluster retains in a state for a long time.

Failed to update master status after upgrading cluster timeout

Major

Check whether the status of the master node is updated after the cluster upgrade timed out.

Failed to update running jobs after upgrading cluster timeout

Major

Check whether running tasks are updated after the cluster upgrade timed out.

Failed to update cluster status

Major

Check whether the cluster status is updated.

Failed to update node status

Major

Check whether the node status is updated.

Failed to remove the static node from database

Major

Check whether nodes are removed from the database after managing nodes timed out.

Failed to update node status to abnormal after node processing timeout

Major

Check whether the node status is updated to abnormal after processing the node timed out.

Failed to update the cluster endpoint

Major

Check whether the cluster endpoint is updated.

Failed to delete the unavailable connection of the Kubernetes cluster

Major

Check whether unavailable Kubernetes connections are deleted.

Failed to sync the cluster cert

Major

Check whether the cluster certificate is synchronized.

Adding Metric Alarms

The following uses promql: 'kube_persistentvolume_status_phase{phase=~"Failed|Pending"} > 0' as an example to describe how to add metric alarms.

This function is provided by AOM. For details, see Creating a Metric Alarm Rule.

The pod CPU usage, physical memory usage, and file system usage alarms must be configured for the everest-csi-controller, everest-csi-driver, coredns, autoscaler, and Yangtse components. Upgrade the specifications in the case of high resource usage to prevent system failures.

  1. Log in to the AOM 2.0 console.
  2. In the navigation pane, choose Alarm Management > Alarm Rules. Then click Create Alarm Rule.
  3. Configure parameters as follows:

    • Rule Type: Select Metric alarm rule.
    • Configuration Mode: Select PromQL. You are advised to specify the cluster for which alarms are generated. Example:

      kube_persistentvolume_status_phase{phase=~"Failed|Pending",cluster="${cluster_id}"} > 0

    • Prometheus Instance: Select the AOM instance interconnected with the cloud native cluster monitoring add-on.
    • Alarm Mode: Select Direct alarm reporting.
    • Action Rule: Select the action rule created for the cluster when Alarm Assistant is enabled. The rule name can be auto-cluster-${cluster_id}.

    This alarm rule works as follows:

    When a PromQL rule is triggered, AOM immediately notifies you through SMN based on the action policy.

    Figure 3 Custom metric alarms

  4. Click Confirm.

    If the following information is displayed in the rule list, the rule is created successfully.

    Figure 4 Alarm rule list