CCE Autopilot Cluster Events

CCE Autopilot can report a range of events to AOM when a cluster is running. You can add event alarms as needed to monitor the health of cluster data plane and control plane components. This helps you quickly identify and resolve problems, ensuring cluster stability and reliability.

Data Plane Events: user operation events, such as workload, network, storage, and auto scaling events.
Control Plane Events: master node events, which are usually caused by faults or upgrades of control plane components.

Data Plane Events

**Table 1** Workload events
Object	Event Name	Severity	Description
Pod	PodOOMKilling	Major	Check whether the pod exits due to OOM. This event is reported by CCE Node Problem Detector (1.18.41 or later) and Cloud Native Log Collection (1.3.2 or later).
Pod	FailedStart	Major	Check whether the pod has started.
Pod	FailedPullImage	Major	Check whether the pod has pulled an image.
Pod	BackOffStart	Major	Check whether the pod fails to restart.
Pod	FailedScheduling	Major	Check whether the pod has been scheduled.
Pod	BackOffPullImage	Major	Check whether the pod has pulled an image after a retry.
Pod	FailedCreate	Major	Check whether the pod has been created.
Pod	Unhealthy	Minor	Check whether the pod health check is successful.
Pod	FailedDelete	Minor	Check whether the workload has been deleted.
Pod	ErrImageNeverPull	Minor	Check whether the workload has pulled an image.
Pod	FailedScaleOut	Minor	Check whether replicas can be added to scale the workload.
Pod	FailedReconfig	Minor	Check whether the pod configuration has been updated.
Pod	FailedActive	Minor	Check whether the pod is activated.
Pod	FailedRollback	Minor	Check whether the pod is rolled back.
Pod	FailedUpdate	Minor	Check whether the pod is updated.
Pod	FailedScaleIn	Minor	Check whether the pod scale-in failed.
Pod	FailedRestart	Minor	Check whether the pod is restarted.
Deployment	SelectorOverlap	Minor	Check whether label selectors in the cluster conflict.
Deployment	ReplicaSetCreateError	Minor	Check whether a workload ReplicaSet can be created.
Deployment	DeploymentRollbackRevisionNotFound	Minor	Check whether the Deployment rollback version is available.
Job	TooManyActivePods	Minor	Check whether there are still active pods after the number of pods in a job reaches the preset value.
Job	TooManySucceededPods	Minor	Check whether there are extra running pods after the number of pods in a job reaches the preset value.
CronJob	FailedGet	Minor	Check whether the CronJob can be queried.
CronJob	FailedList	Minor	Check whether the list of pods can be obtained.
CronJob	UnexpectedJob	Minor	Check whether there are any unknown CronJobs.

**Table 2** Network events
Object	Event Name	Severity	Description
Service	CreatingLoadBalancerFailed	Minor	Check whether a load balancer has been created.
Service	DeletingLoadBalancerFailed	Minor	Check whether the load balancer has been deleted.
Service	UpdateLoadBalancerFailed	Minor	Check whether the load balancer has been updated.

**Table 3** Storage events
Object	Event Name	Severity	Description
PV	DetachVolumeFailed	Minor	Check whether the block storage is mounted.
PV	VolumeUnknownReclaimPolicy	Minor	Check whether a volume reclaim policy is specified.
PV	SetUpAtVolumeFailed	Minor	Check whether the volume is mounted.
PV	VolumeFailedRecycle	Minor	Check whether the volume is reclaimed.
PV	WaitForAttachVolumeFailed	Minor	Check whether the block storage is mounted to the node.
PV	VolumeFailedDelete	Minor	Check whether the volume is deleted.
PV	MountDeviceFailed	Minor	Check whether the device is mounted.
PV	TearDownAtVolumeFailed	Minor	Check whether the volume is unmounted.
PV	UnmountDeviceFailed	Minor	Check whether the device is unmounted.
PV	AttachVolumeFailed	Minor	Check whether the block storage is demounted from the node.
PVC	VolumeResizeFailed	Minor	Check whether the volume capacity is expanded.
PVC	ClaimLost	Minor	Check whether the PVC is normal.
PVC	ProvisioningFailed	Minor	Check whether the volume is created.
PVC	ProvisioningCleanupFailed	Minor	Check whether the volume has been cleared.
PVC	ClaimMisbound	Minor	Check whether the PVC is bound to an incorrect volume.

**Table 4** Auto scaling events
Object	Event Name	Severity	Description
HPA	InvalidTargetRange	Major	Invalid extendedhpa.metrics is configured in annotations of HPA. The metric type in spec of HPA is incorrect.
HPA	FailedGetScale	Major	HPA failed to obtain the resource object to be scaled.
HPA	FailedComputeMetricsReplicas	Major	An error occurs when the number of replicas to be adjusted for resources is calculated. For example, metric-server is unavailable, resource metric collection fails, or the CPU usage is incorrectly set. You can run the following command to view details: kubectl describe horizontalpodautoscaler <hpa-name>
HPA	FailedGetObjectMetric	Major	Failed to obtain the metrics of the specified object (such as PVC and ConfigMap).
HPA	FailedGetPodsMetric	Major	Failed to obtain pod resource metrics (resource usages of a pod).
HPA	FailedGetResourceMetric	Major	Failed to obtain cluster resource metrics (resource usages of a cluster).
HPA	FailedGetContainerResourceMetric	Major	Failed to obtain the resource metrics of a container.
HPA	FailedGetExternalMetric	Major	Failed to obtain external metrics.
HPA	FailedRescale	Major	Failed to update the desired number of copies of the resource object to be scaled.
HPA	SuccessfulRescale	Minor	The desired number of copies of the resource object to be scaled is updated.
CronHPA	ScaleFailed	Major	CronHPA failed to update the desired number of copies of the resource object to be scaled.
CronHPA	FailedGetHorizontalPodAutoscaler	Major	CronHPA failed to query the associated HPA object. (Generally, kube-apiserver cannot respond.)
CronHPA	FailedGetHpaScale	Major	CronHPA failed to obtain the resource object to be scaled.
CronHPA	UpdateHPAFailed	Major	CronHPA failed to update the associated HPA object.
CronHPA	UpdateHPASuccess	Minor	CronHPA successfully updates the associated HPA object.
CronHPA	SkipUpdateHPA	Minor	CronHPA skips updating the associated HPA object.
CronHPA	SkipUpdateTarget	Minor	CronHPA skips updating the number of copies of the resource object to be scaled.
CronHPA	UpdateTargetSuccess	Minor	CronHPA successfully updates the number of copies of the resource object to be scaled.

Control Plane Events

**Table 5** Control plane events
Event ID	Severity	Description
Internal error	Major	Check whether there is an internal error in the cluster.
Failed to check component status or components are abnormal	Major	Check whether the statuses of cluster components can be obtained or the components malfunction.
Cluster status is Unavailable	Major	Check whether the cluster is available.
Cluster status is Error	Major	Check whether the cluster is faulty.
Cluster status is not updated for a long time	Major	Check whether the cluster is stuck in a state for a long period.
Failed to update cluster status	Major	Check whether the cluster status is updated.
Failed to delete the unavailable connection of the Kubernetes cluster	Major	Check whether unavailable Kubernetes connections have been deleted.
Failed to sync the cluster cert	Major	Check whether the cluster certificates have been synchronized.