Configuring Custom Alarms on AOM
CCE interworks with AOM to report alarms and events. By setting alarm rules on AOM, you can check whether resources in clusters are normal in a timely manner.
Process
- Creating a Topic on SMN
- Creating an Action Policy
- Adding an Alarm Rule
- Event alarms: Generate alarms based on the events reported by clusters to AOM. For details about the events and configurations, see Adding Event Alarms.
- Metric alarms: Generate alarms based on the thresholds of monitoring metrics, such as resource utilization of servers and components. For details about the metric thresholds and configurations, see Adding Metric Alarms.
Creating a Topic on SMN
Simple Message Notification (SMN) pushes messages to subscribers through emails, SMS messages, and HTTP/HTTPS requests.
A topic is used to publish messages and subscribe to notifications. It serves as a message transmission channel between publishers and subscribers.
You need to create a topic and add a subscription to it.
After subscribing to a topic, confirm the subscription in the email or SMS message for the notification to take effect.
Creating an Action Policy
AOM allows you to customize alarm action policies. You can create an alarm action policy to associate an SMN topic and a message template. You can also customize notification content by using a message template.
For details, see Creating Alarm Action Policies. When creating an action policy, select the topic that is created and subscribed to in Creating a Topic on SMN.
Adding Event Alarms
The following uses the NodeNotReady alarm as an example to describe how to add an event alarm.
This function is provided by AOM. For details about parameters, see Creating an Event Alarm Rule.
Event Name |
Source |
Description |
Solution |
---|---|---|---|
NodeNotReady |
CCE |
An alarm is triggered immediately when a node is abnormal. |
Log in to the cluster and check the status of the node for which the alarm is generated. Set the node as unschedulable and schedule the service pods to another node. |
Rebooted |
CCE |
An alarm is triggered immediately when a node is restarted. |
Log in to the cluster to check the status of the node for which the alarm is generated, check whether the node can be started properly, and locate the cause of the restart. |
KUBELETIsDown |
CCE |
An alarm is triggered immediately when a node is abnormal. |
Log in to the cluster and check the status of the node for which the alarm is generated. Set the node as unschedulable and schedule the service pods to another node. Then, restart kubelet. |
DOCKERIsDown |
CCE |
An alarm is triggered immediately when a node is abnormal. |
Log in to the cluster and check the status of the node for which the alarm is generated. Set the node as unschedulable and schedule the service pods to another node. Then, restart Docker. |
KUBEPROXYIsDown |
CCE |
An alarm is triggered immediately when a node is abnormal. |
Log in to the cluster and check the status of the node for which the alarm is generated. Set the node as unschedulable and schedule the service pods to another node. |
KernelOops |
CCE |
An alarm is triggered immediately when a node is abnormal. |
Log in to the cluster and check the status of the node for which the alarm is generated. Set the node as unschedulable and schedule the service pods to another node. |
ConntrackFull |
CCE |
An alarm is triggered immediately when a node is abnormal. |
Log in to the cluster and check the status of the node for which the alarm is generated. Set the node as unschedulable and schedule the service pods to another node. |
NodePoolSoldOut |
CCE |
An alarm is triggered immediately when node pool resources are sold out. |
Set auto node pool switchover or change the node pool specifications. |
NodeCreateFailed |
CCE |
An alarm is triggered immediately upon a node creation failure. |
Rectify the failure and create the node again. |
ScaleUpTimedOut |
CCE |
An alarm is triggered immediately upon node scale-out timeout. |
Rectify the failure and try scale-out again. |
ScaleDownFailed |
CCE |
An alarm is triggered immediately upon node scale-in timeout. |
Rectify the failure and try scale-in again. |
BackOffPullImage |
CCE |
Image pull retry failed. |
Log in to the cluster, locate the failure cause, and deploy the service workload again. |
- Log in to the AOM console.
- In the navigation pane, choose Alarm Center > Alarm Rules and click Add Alarm.
- Configure an alarm rule.
- Rule Type: Select Event alarm.
- Alarm Source: Select CCE.
- Select Object: Select Event Name and then NodeNotReady. You can filter trigger objects by notification type, event name, alarm severity, custom attribute, namespace, and cluster name.
- Triggering Policy: Select Immediate Triggering.
- Alarm Mode: Select Direct Alarm Reporting.
- Action Policy: Select the action policy created in Creating an Action Policy.
This alarm rule works as follows:
If a node in the cluster becomes abnormal, CCE reports the NodeNotReady event to AOM. AOM immediately notifies you through SMN based on the action policy.
Figure 1 Creating an event alarm
- Click Create Now.
If the following information is displayed in the rule list, the rule is created successfully.
CCE Events
Event alarms are generated based on the events reported by CCE to AOM. CCE reports a series of events to AOM. You can view specific events in the Alarm Rule Settings areas and add event alarms as required.
Data plane events and control plane events can be reported for CCE clusters.
Type |
Event Name |
Severity |
Remarks |
---|---|---|---|
Pod |
PodOOMKilling |
Major |
Check whether the pod exits due to OOM. This event is reported by CCE Node Problem Detector (1.18.41 or later) and Cloud Native Logging (1.3.2 or later). |
Pod |
FailedStart |
Major |
Check whether the pod is started. |
Pod |
FailedPullImage |
Major |
Check whether the pod has pulled an image. |
Pod |
BackOffStart |
Major |
Check whether the pod fails to be restarted. |
Pod |
FailedScheduling |
Major |
Check whether the pod is scheduled. |
Pod |
BackOffPullImage |
Major |
Check whether the pod has pulled an image after a retry. |
Pod |
FailedCreate |
Major |
Check whether a pod is created. |
Pod |
Unhealthy |
Minor |
Check whether the pod is running normally. |
Pod |
FailedDelete |
Minor |
Check whether the workload is deleted. |
Pod |
ErrImageNeverPull |
Minor |
Check whether the workload has pulled an image. |
Pod |
FailedScaleOut |
Minor |
Check whether workload copies are scaled out. |
Pod |
FailedStandBy |
Minor |
Check whether the pod enters the standby state. |
Pod |
FailedReconfig |
Minor |
Check whether the pod configuration is updated. |
Pod |
FailedActive |
Minor |
Check whether the pod is activated. |
Pod |
FailedRollback |
Minor |
Check whether the pod is rolled back. |
Pod |
FailedUpdate |
Minor |
Check whether the pod is updated. |
Pod |
FailedScaleIn |
Minor |
Check whether a pod scale-in failed. |
Pod |
FailedRestart |
Minor |
Check whether the pod is restarted. |
Deployment |
SelectorOverlap |
Minor |
Check whether label selectors in the cluster conflict. |
Deployment |
ReplicaSetCreateError |
Minor |
Check whether a workload ReplicaSet can be created. |
Deployment |
DeploymentRollbackRevisionNotFound |
Minor |
Check whether the Deployment rollback version is available. |
DaemonSet |
SelectingAll |
Minor |
Check whether the workload label selector is correctly configured. |
Job |
TooManyActivePods |
Minor |
Check whether there are still active pods after the number of pods in a job reaches the preset value. |
Job |
TooManySucceededPods |
Minor |
Check whether there are extra running pods after the number of pods in a job reaches the preset value. |
CronJob |
FailedGet |
Minor |
Check whether CronJobs can be obtained. |
CronJob |
FailedList |
Minor |
Check whether pods can be obtained. |
CronJob |
UnexpectedJob |
Minor |
Check whether there are any unknown CronJobs. |
Service |
CreatingLoadBalancerFailed |
Minor |
Check whether a load balancer is created. |
Service |
DeletingLoadBalancerFailed |
Minor |
Check whether the load balancer is deleted. |
Service |
UpdateLoadBalancerFailed |
Minor |
Check whether the load balancer is updated. |
Namespace |
DeleteNodeWithNoServer |
Minor |
Check whether discarded nodes are cleared. |
PV |
DetachVolumeFailed |
Minor |
Check whether the block storage is detached. |
PV |
VolumeUnknownReclaimPolicy |
Minor |
Check whether a volume reclamation policy is specified. |
PV |
SetUpAtVolumeFailed |
Minor |
Check whether the data volume is mounted. |
PV |
VolumeFailedRecycle |
Minor |
Check whether the data volume is reclaimed. |
PV |
WaitForAttachVolumeFailed |
Minor |
Check whether block storage is attached to the node. |
PV |
VolumeFailedDelete |
Minor |
Check whether the data volume is deleted. |
PV |
MountDeviceFailed |
Minor |
Check whether the data volume is mounted. |
PV |
TearDownAtVolumeFailed |
Minor |
Check whether the data volume is unmounted. |
PV |
UnmountDeviceFailed |
Minor |
Check whether the drive letter of the data volume is unmounted. |
PV |
AttachVolumeFailed |
Minor |
Check whether block storage is detached from the node. |
PVC |
VolumeResizeFailed |
Minor |
Check whether the capacity of the data volume is expanded. |
PVC |
ClaimLost |
Minor |
Check whether the PVC volume is working properly. |
PVC |
ProvisioningFailed |
Minor |
Check whether the data volume is created. |
PVC |
ProvisioningCleanupFailed |
Minor |
Check whether the data volume has been cleared. |
PVC |
ClaimMisbound |
Minor |
Check whether the PVC is bound to an incorrect volume. |
Node |
Rebooted |
Major |
Check whether the node is restarted. |
Node |
NodeNotSchedulable |
Major |
Check whether the node is schedulable. |
Node |
NodeNotReady |
Major |
Check whether the node is running normally. |
Node |
NodeCreateFailed |
Major |
Check whether the node is created. |
Node |
KUBELETIsDown |
Minor |
Check the kubelet status on the node. |
Node |
NodeHasInsufficientMemory |
Minor |
Check whether the available memory of the node is sufficient. |
Node |
UnregisterNetDevice |
Minor |
Check whether the node is associated with any unregistered network device. |
Node |
NetworkCardNotFound |
Minor |
Check the node ENI status. |
Node |
KUBEPROXYIsDown |
Minor |
Check whether kube-proxy is running normally on the node. |
Node |
NodeOutOfDisk |
Minor |
Check whether the node disk space is sufficient. |
Node |
TaskHung |
Minor |
Check whether there are any suspended tasks on the node. |
Node |
CIDRNotAvailable |
Minor |
Check whether the node CIDR block is available. |
Node |
ConntrackFull |
Minor |
Check whether the node conntrack table is full. |
Node |
NodeHasDiskPressure |
Minor |
Check whether the node disk space is sufficient. |
Node |
NodeInstallFailed |
Minor |
Check whether nodes are managed in the cluster. |
Node |
KernelOops |
Minor |
Check whether the OS kernel of the node is faulty. |
Node |
OOMKilling |
Minor |
Check whether OOM occurs on the node. |
Node |
DOCKERIsDown |
Minor |
Check whether the container engine of the node is working properly. |
Node |
CIDRAssignmentFailed |
Minor |
Check whether a CIDR block is allocated for the node. |
Node |
DockerHung |
Minor |
Check whether the Docker process on the node is suspended. |
Node |
FilesystemIsReadOnly |
Minor |
Check whether the file system of the node is read-only. |
Node |
NTPIsDown |
Minor |
Check whether NTP is running normally on the node. |
Node |
NodeUninstallFailed |
Minor |
Check whether the node is uninstalled. |
Node |
AUFSUmountHung |
Minor |
Check whether detaching the node disk is suspended. |
Node |
CNIIsDown |
Minor |
Check whether the CNI add-on on the node is faulty. |
Autoscaler |
ScaleUpTimedOut |
Major |
Check whether adding nodes to the node pool timed out. |
Autoscaler |
NodePoolAvailable |
Major |
Check whether the node pool resources are sufficient. |
Autoscaler |
ScaleDown |
Major |
Nodes are being deleted from the cluster. |
Autoscaler |
NotTriggerScaleUp |
Major |
Check whether a node scale-out is triggered. |
Autoscaler |
DeleteUnregistered |
Major |
Check whether unregistered nodes are deleted. |
Autoscaler |
ScaleDownEmpty |
Major |
Check whether idle nodes are scaled in. |
Autoscaler |
ScaleDownFailed |
Major |
Check whether nodes are scaled in. |
Autoscaler |
FailedToScaleUpGroup |
Major |
Check whether an error occurred during a node pool scale-out. |
Autoscaler |
ScaledUpGroup |
Major |
Check whether the node pool is scaled out. |
Autoscaler |
ScaleUpFailed |
Major |
Check whether the node is scaled out. |
Autoscaler |
FixNodeGroupSizeDone |
Major |
Check whether the number of nodes in the node pool is restored. |
Autoscaler |
NodeGroupInBackOff |
Major |
Check whether there are any rollback retries during node pool scaling. |
Autoscaler |
FixNodeGroupSizeError |
Major |
Check whether the number of nodes in the node pool is restored. |
Autoscaler |
NodePoolSoldOut |
Major |
Check whether the node pool resources are sufficient. |
Autoscaler |
TriggeredScaleUp |
Major |
Check whether a node scale-out is triggered. |
Autoscaler |
StartScaledUpGroup |
Major |
Check whether a node pool scaled-out is started. |
Autoscaler |
DeleteUnregisteredFailed |
Major |
Check whether unregistered nodes are deleted. |
Event ID |
Severity |
Description |
---|---|---|
Internal error |
Major |
Check whether an internal error occurs in the cluster. |
External dependency error |
Major |
Check whether an error occurs in cluster external dependencies. |
Failed to initialize process thread |
Major |
Check whether a cluster initialization thread is executed. |
Failed to update database |
Major |
Check whether the database for the cluster is updated. |
Failed to create node by nodepool |
Major |
Check whether nodes are created in the node pool. |
Failed to delete node by nodepool |
Major |
Check whether nodes are deleted from the node pool. |
Failed to create yearly/monthly subscription node |
Major |
Check whether the yearly/monthly node is created in the cluster. |
Failed to cancel the authorization of accessing the image of the master. |
Major |
When creating a cluster, check whether the authorization for the resource tenant to access the master node image is canceled. |
Failed to create the virtual IP for the master |
Major |
When creating a cluster, check whether the virtual IP address is allocated. |
Failed to delete the node VM |
Major |
Check whether the node (VM) is deleted from the cluster. |
Failed to delete the security group of node |
Major |
Check whether the security group of the node is deleted from the cluster. |
Failed to delete the security group of master |
Major |
Check whether the security group of the master node is deleted from the cluster. |
Failed to delete the security group of port |
Major |
Check whether the ENI security group of the master node is deleted from the cluster. |
Failed to delete the security group of eni or subeni |
Major |
Check whether ENI or sub-ENI security group is deleted from the cluster. |
Failed to detach the port of master |
Major |
Check whether the ENI of the master node is unbound from the cluster. |
Failed to delete the port of master |
Major |
Check whether the ENI of the master node is deleted from the cluster. |
Failed to delete the master VM |
Major |
Check whether master node (VM) is deleted from the cluster. |
Failed to delete the key pair of master |
Major |
Check whether the key pair of the master node is deleted from the cluster. |
Failed to delete the subnet of master |
Major |
Check whether the subnet of the master node is deleted from the cluster. |
Failed to delete the VPC of master |
Major |
Check whether the VPC of the master node is deleted from the cluster. |
Failed to delete certificate of cluster |
Major |
Check whether the certificate is deleted from the cluster. |
Failed to delete the server group of master |
Major |
Check whether the master node (ECS) is deleted from the cluster. |
Failed to delete the virtual IP for the master |
Major |
Check whether the virtual IP address is deleted from the cluster. |
Failed to get floating IP of the master |
Major |
Check whether the floating IP address of the master node is obtained. |
Failed to get cluster flavor |
Major |
Check whether the cluster flavor is obtained. |
Failed to get cluster endpoint |
Major |
Check whether the cluster endpoint is obtained. |
Failed to get Kubernetes connection |
Major |
Check whether the Kubernetes cluster connections are obtained. |
Failed to update secret |
Major |
Check whether the cluster Secret is updated. |
Operation timed out |
Major |
Check whether the user operation timed out. |
Connecting to Kubernetes cluster timed out |
Major |
Check whether accessing the Kubernetes cluster timed out. |
Failed to check component status or components are abnormal |
Major |
Check whether the statuses of cluster components can be obtained or whether the components malfunction. |
The node is not found in kubernetes cluster |
Major |
Check whether the node can be found in the Kubernetes cluster. |
The status of node is not ready in kubernetes cluster |
Major |
Check whether the node is running properly in the Kubernetes cluster. |
Can't find corresponding vm of this node in ECS |
Major |
Check whether the node can be found on the ECS console. |
Failed to upgrade the master |
Major |
Check whether the master node has been upgraded. |
Failed to upgrade the node |
Major |
Check whether the node has been upgraded. |
Failed to change flavor of the master |
Major |
Check whether the master node flavor has been changed. |
Change flavor of the master timeout |
Major |
Check whether changing the master node flavor timed out. |
Failed to pass verification while creating yearly/monthly subscription node |
Major |
Check whether creating a yearly/monthly node has been verified. |
Failed to install the node |
Major |
Check whether the node is installed in the cluster. |
Failed to clean routes of cluster container network in VPC |
Major |
Check whether the routes of cluster container VPCs are cleaned. |
Cluster status is Unavailable |
Major |
Check whether the cluster is available. |
Cluster status is Error |
Major |
Check whether the cluster is faulty. |
Cluster status is not updated for a long time |
Major |
Check whether the cluster retains in a state for a long time. |
Failed to update master status after upgrading cluster timeout |
Major |
Check whether the status of the master node is updated after the cluster upgrade timed out. |
Failed to update running jobs after upgrading cluster timeout |
Major |
Check whether running tasks are updated after the cluster upgrade timed out. |
Failed to update cluster status |
Major |
Check whether the cluster status is updated. |
Failed to update node status |
Major |
Check whether the node status is updated. |
Failed to remove the static node from database |
Major |
Check whether nodes are removed from the database after managing nodes timed out. |
Failed to update node status to abnormal after node processing timeout |
Major |
Check whether the node status is updated to abnormal after processing the node timed out. |
Failed to update the cluster endpoint |
Major |
Check whether the cluster endpoint is updated. |
Failed to delete the unavailable connection of the Kubernetes cluster |
Major |
Check whether unavailable Kubernetes connections are deleted. |
Failed to sync the cluster cert |
Major |
Check whether the cluster certificate is synchronized. |
Adding Metric Alarms
The following uses promql: 'kube_persistentvolume_status_phase{phase=~"Failed|Pending"} > 0' as an example to describe how to add metric alarms.
This function is provided by AOM. For details, see Creating a Metric Alarm Rule.
The pod CPU usage, physical memory usage, and file system usage alarms must be configured for the everest-csi-controller, everest-csi-driver, coredns, autoscaler, and Yangtse components. Upgrade the specifications in the case of high resource usage to prevent system failures.
- Log in to the AOM 2.0 console.
- In the navigation pane, choose Alarm Management > Alarm Rules. Then click Create Alarm Rule.
- Configure parameters as follows:
- Rule Type: Select Metric alarm rule.
- Configuration Mode: Select PromQL. You are advised to specify the cluster for which alarms are generated. Example:
kube_persistentvolume_status_phase{phase=~"Failed|Pending",cluster="${cluster_id}"} > 0
- Prometheus Instance: Select the AOM instance interconnected with the cloud native cluster monitoring add-on.
- Alarm Mode: Select Direct alarm reporting.
- Action Rule: Select the action rule created for the cluster when Alarm Assistant is enabled. The rule name can be auto-cluster-${cluster_id}.
This alarm rule works as follows:
When a PromQL rule is triggered, AOM immediately notifies you through SMN based on the action policy.
Figure 3 Custom metric alarms
- Click Confirm.
If the following information is displayed in the rule list, the rule is created successfully.
Figure 4 Alarm rule list
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot