Updated on 2024-09-30 GMT+08:00

CCE Events

CCE can report a range of events in a running cluster to AOM. You can add event alarms as required to monitor the health of cluster data plane and control plane components. This helps you quickly identify and resolve problems, ensuring cluster stability and reliability.

Data Plane Events

Table 1 Workload events

Category

Event Name

Severity

Description

Pod

PodOOMKilling

Major

Check whether the pod exits due to OOM.

This event is reported by CCE Node Problem Detector (1.18.41 or later) and Cloud Native Logging (1.3.2 or later).

Pod

FailedStart

Major

Check whether the pod is started.

Pod

FailedPullImage

Major

Check whether the pod has pulled an image.

Pod

BackOffStart

Major

Check whether the pod fails to be restarted.

Pod

FailedScheduling

Major

Check whether the pod is scheduled.

Pod

BackOffPullImage

Major

Check whether the pod has pulled an image after a retry.

Pod

FailedCreate

Major

Check whether the pod is created.

Pod

Unhealthy

Minor

Check whether the pod health check is successful.

Pod

FailedDelete

Minor

Check whether the workload is deleted.

Pod

ErrImageNeverPull

Minor

Check whether the workload has pulled an image.

Pod

FailedScaleOut

Minor

Check whether workload copies are scaled out.

Pod

FailedStandBy

Minor

Check whether the pod enters the standby state.

Pod

FailedReconfig

Minor

Check whether the pod configuration is updated.

Pod

FailedActive

Minor

Check whether the pod is activated.

Pod

FailedRollback

Minor

Check whether the pod is rolled back.

Pod

FailedUpdate

Minor

Check whether the pod is updated.

Pod

FailedScaleIn

Minor

Check whether a pod scale-in failed.

Pod

FailedRestart

Minor

Check whether the pod is restarted.

Deployment

SelectorOverlap

Minor

Check whether label selectors in the cluster conflict.

Deployment

ReplicaSetCreateError

Minor

Check whether a workload ReplicaSet can be created.

Deployment

DeploymentRollbackRevisionNotFound

Minor

Check whether the Deployment rollback version is available.

DaemonSet

SelectingAll

Minor

Check whether the workload label selector is correctly configured.

Job

TooManyActivePods

Minor

Check whether there are still active pods after the number of pods in a job reaches the preset value.

Job

TooManySucceededPods

Minor

Check whether there are extra running pods after the number of pods in a job reaches the preset value.

CronJob

FailedGet

Minor

Check whether CronJobs can be obtained.

CronJob

FailedList

Minor

Check whether pods can be obtained.

CronJob

UnexpectedJob

Minor

Check whether there are any unknown CronJobs.

Table 2 Network events

Category

Event Name

Severity

Description

Service

CreatingLoadBalancerFailed

Minor

Check whether the load balancer is created.

Service

DeletingLoadBalancerFailed

Minor

Check whether the load balancer is deleted.

Service

UpdateLoadBalancerFailed

Minor

Check whether the load balancer is updated.

Table 3 Node events

Category

Event Name

Severity

Description

Node

Rebooted

Major

Check whether the node is restarted.

Node

NodeNotSchedulable

Major

Check whether the node is schedulable.

Node

NodeNotReady

Major

Check whether the node is running normally.

Node

NodeCreateFailed

Major

Check whether the node is created.

Node

KUBELETIsDown

Minor

Check whether kubelet is running normally on the node.

Node

NodeHasInsufficientMemory

Minor

Check whether the available memory of the node is sufficient.

Node

UnregisterNetDevice

Minor

Check whether the node is associated with any unregistered network device.

Node

NetworkCardNotFound

Minor

Check the node ENI status.

Node

KUBEPROXYIsDown

Minor

Check whether kube-proxy is running normally on the node.

Node

NodeOutOfDisk

Minor

Check whether the node disk space is sufficient.

Node

TaskHung

Minor

Check whether there are any suspended tasks on the node.

Node

CIDRNotAvailable

Minor

Check whether the node CIDR block is available.

Node

ConntrackFull

Minor

Check whether the connection tracking table on the node is full.

Node

NodeHasDiskPressure

Minor

Check whether the node disk space is sufficient.

Node

NodeInstallFailed

Minor

Check whether nodes are managed in the cluster.

Node

KernelOops

Minor

Check whether the OS kernel of the node is faulty.

Node

OOMKilling

Minor

  • The memory used by pods on the node exceeds the limit. As a result, the process is terminated.
  • The memory used by pods on the node does not exceed the limit, but the available memory of the node is insufficient. As a result, OOM occurs.

Node

DOCKERIsDown

Minor

Check whether the container engine of the node is running normally.

Node

CIDRAssignmentFailed

Minor

Check whether a CIDR block is allocated for the node.

Node

DockerHung

Minor

Check whether the Docker process on the node is suspended.

Node

FilesystemIsReadOnly

Minor

Check whether the file system of the node is read-only.

Node

NTPIsDown

Minor

Check whether NTP is running normally on the node.

Node

NodeUninstallFailed

Minor

Check whether the node is uninstalled.

Node

AUFSUmountHung

Minor

Check whether detaching the node disk is suspended.

Node

CNIIsDown

Minor

Check whether the CNI add-on on the node is faulty.

Namespace

DeleteNodeWithNoServer

Minor

Check whether discarded nodes are cleared.

Table 4 Storage events

Category

Event Name

Severity

Description

PV

DetachVolumeFailed

Minor

Check whether the block storage is detached.

PV

VolumeUnknownReclaimPolicy

Minor

Check whether a volume reclamation policy is specified.

PV

SetUpAtVolumeFailed

Minor

Check whether the data volume is mounted.

PV

VolumeFailedRecycle

Minor

Check whether the data volume is reclaimed.

PV

WaitForAttachVolumeFailed

Minor

Check whether block storage is attached to the node.

PV

VolumeFailedDelete

Minor

Check whether the data volume is deleted.

PV

MountDeviceFailed

Minor

Check whether the data volume is mounted.

PV

TearDownAtVolumeFailed

Minor

Check whether the data volume is detached.

PV

UnmountDeviceFailed

Minor

Check whether the drive letter of the data volume is unmounted.

PV

AttachVolumeFailed

Minor

Check whether block storage is detached from the node.

PVC

VolumeResizeFailed

Minor

Check whether the capacity of the data volume is expanded.

PVC

ClaimLost

Minor

Check whether the PVC volume is normal.

PVC

ProvisioningFailed

Minor

Check whether the data volume is created.

PVC

ProvisioningCleanupFailed

Minor

Check whether the data volume is cleared.

PVC

ClaimMisbound

Minor

Check whether the PVC is bound to an incorrect volume.

Table 5 Auto scaling events

Category

Event Name

Severity

Description

Autoscaler

ScaleUpTimedOut

Major

Check whether adding nodes to the node pool timed out.

Autoscaler

NodePoolAvailable

Major

Check whether the node pool resources are sufficient.

Autoscaler

ScaleDown

Major

Nodes are being deleted from the cluster.

Autoscaler

NotTriggerScaleUp

Major

Check whether a node scale-out is triggered.

Autoscaler

DeleteUnregistered

Major

Check whether unregistered nodes are deleted.

Autoscaler

ScaleDownEmpty

Major

Check whether idle nodes are scaled in.

Autoscaler

ScaleDownFailed

Major

Check whether nodes are scaled in.

Autoscaler

FailedToScaleUpGroup

Major

Check whether an error occurred during a node pool scale-out.

Autoscaler

ScaledUpGroup

Major

Check whether the node pool is scaled out.

Autoscaler

ScaleUpFailed

Major

Check whether the node is scaled out.

Autoscaler

FixNodeGroupSizeDone

Major

Check whether the number of nodes in the node pool is restored.

Autoscaler

NodeGroupInBackOff

Major

Check whether there are any rollback retries during node pool scaling.

Autoscaler

FixNodeGroupSizeError

Major

Check whether the number of nodes in the node pool is restored.

Autoscaler

NodePoolSoldOut

Major

Check whether the node pool resources are sufficient.

Autoscaler

TriggeredScaleUp

Major

Check whether a node scale-out is triggered.

Autoscaler

StartScaledUpGroup

Major

Check whether a node pool scaled-out is started.

Autoscaler

DeleteUnregisteredFailed

Major

Check whether unregistered nodes are deleted.

HPA

InvalidTargetRange

Major

  • Invalid extendedhpa.metrics is configured in annotations of HPA.
  • The metric type in spec of HPA is incorrect.

HPA

FailedGetScale

Major

HPA failed to obtain the resource object to be scaled.

HPA

FailedComputeMetricsReplicas

Major

An error occurs when the number of copies to be adjusted for resources is calculated. For example, metric-server is unavailable, resource metric collection fails, or the CPU usage is incorrectly set.

You can run the following command to view details:

kubectl describe horizontalpodautoscaler <hpa-name>

HPA

FailedGetObjectMetric

Major

Failed to obtain the metrics of the specified object (such as PVC and ConfigMaps).

HPA

FailedGetPodsMetric

Major

Failed to obtain the pod resource metric (resource usage of a pod).

HPA

FailedGetResourceMetric

Major

Failed to obtain the cluster resource metric (resource usage of a cluster).

HPA

FailedGetContainerResourceMetric

Major

Failed to obtain the resource metrics of a container.

HPA

FailedGetExternalMetric

Major

Failed to obtain external metrics.

HPA

FailedRescale

Major

Failed to update the desired number of copies of the resource object to be scaled.

HPA

SuccessfulRescale

Minor

The desired number of copies of the resource object to be scaled is updated.

CronHPA

ScaleFailed

Major

CronHPA failed to update the desired number of copies of the resource object to be scaled.

CronHPA

FailedGetHorizontalPodAutoscaler

Major

CronHPA failed to query the associated HPA object. (Generally, kube-apiserver cannot respond.)

CronHPA

FailedGetHpaScale

Major

CronHPA failed to obtain the resource object to be scaled.

CronHPA

UpdateHPAFailed

Major

CronHPA failed to update the associated HPA object.

CronHPA

UpdateHPASuccess

Minor

CronHPA successfully updates the associated HPA object.

CronHPA

SkipUpdateHPA

Minor

CronHPA skips updating the associated HPA object.

CronHPA

SkipUpdateTarget

Minor

CronHPA skips updating the number of copies of the resource object to be scaled.

CronHPA

UpdateTargetSuccess

Minor

CronHPA successfully updates the number of copies of the resource object to be scaled.

CustomedHPA

FailedSetPolicySettings

Major

Failed to parse the cooldown period of CustomedHPA.

CustomedHPA

FailedSubmitRule

Major

CustomedHPA failed to process schedule rules or metric rules.

CustomedHPA

FailedComputeReplicas

Major

CustomedHPA failed to trigger resource scaling based on the compute metrics.

CustomedHPA

FailedScale

Major

CustomedHPA failed to update the desired number of copies of the resource object to be scaled. (Generally, kube-apiserver cannot respond).

CustomedHPA

MetricScaleSuccess

Minor

CustomedHPA triggers resource scaling based on the metric rule.

CustomedHPA

CronScaleSuccess

Minor

CustomedHPA triggers resource scaling based on the periodic rule.

Control Plane Events

Table 6 Control plane events

Event ID

Severity

Description

Internal error

Major

Check whether an internal error occurs in the cluster.

External dependency error

Major

Check whether an error occurs in cluster external dependencies.

Failed to initialize process thread

Major

Check whether a cluster initialization thread is executed.

Failed to update database

Major

Check whether the database for the cluster is updated.

Failed to create node by nodepool

Major

Check whether nodes are created in the node pool.

Failed to delete node by nodepool

Major

Check whether nodes are deleted from the node pool.

Failed to create yearly/monthly subscription node

Major

Check whether the yearly/monthly node is created in the cluster.

Failed to cancel the authorization of accessing the image of the master.

Major

When creating a cluster, check whether the authorization for the resource tenant to access the master node image is canceled.

Failed to create the virtual IP for the master

Major

When creating a cluster, check whether the virtual IP address is allocated.

Failed to delete the node VM

Major

Check whether the node (VM) is deleted from the cluster.

Failed to delete the security group of node

Major

Check whether the security group of the node is deleted from the cluster.

Failed to delete the security group of master

Major

Check whether the security group of the master node is deleted from the cluster.

Failed to delete the security group of port

Major

Check whether the ENI security group of the master node is deleted from the cluster.

Failed to delete the security group of eni or subeni

Major

Check whether ENI or sub-ENI security group is deleted from the cluster.

Failed to detach the port of master

Major

Check whether the ENI of the master node is detached from the cluster.

Failed to delete the port of master

Major

Check whether the ENI of the master node is deleted from the cluster.

Failed to delete the master VM

Major

Check whether the master node (VM) is deleted from the cluster.

Failed to delete the key pair of master

Major

Check whether the key pair of the master node is deleted from the cluster.

Failed to delete the subnet of master

Major

Check whether the subnet of the master node is deleted from the cluster.

Failed to delete the VPC of master

Major

Check whether the VPC of the master node is deleted from the cluster.

Failed to delete certificate of cluster

Major

Check whether the certificate is deleted from the cluster.

Failed to delete the server group of master

Major

Check whether the master node (ECS) is deleted from the cluster.

Failed to delete the virtual IP for the master

Major

Check whether the virtual IP address is deleted from the cluster.

Failed to get floating IP of the master

Major

Check whether the floating IP address of the master node is obtained.

Failed to get cluster flavor

Major

Check whether the cluster flavor is obtained.

Failed to get cluster endpoint

Major

Check whether the cluster endpoint is obtained.

Failed to get kubernetes connection

Major

Check whether the Kubernetes cluster connections are obtained.

Failed to update secret

Major

Check whether the cluster Secret is updated.

Operation timed out

Major

Check whether the user operation timed out.

Connecting to Kubernetes cluster timed out

Major

Check whether accessing the Kubernetes cluster timed out.

Failed to check component status or components are abnormal

Major

Check whether the statuses of cluster components can be obtained or whether the components malfunction.

The node is not found in kubernetes cluster

Major

Check whether the node can be found in the Kubernetes cluster.

The status of node is not ready in kubernetes cluster

Major

Check whether the node is running normally in the Kubernetes cluster.

Can't find corresponding vm of this node in ECS

Major

Check whether the node can be found on the ECS console.

Failed to upgrade the master

Major

Check whether the master node has been upgraded.

Failed to upgrade the node

Major

Check whether the node has been upgraded.

Failed to change flavor of the master

Major

Check whether the master node flavor has been changed.

Change flavor of the master timeout

Major

Check whether changing the master node flavor timed out.

Failed to pass verification while creating yearly/monthly subscription node

Major

Check whether creating a yearly/monthly node has been verified.

Failed to install the node

Major

Check whether the node is installed in the cluster.

Failed to clean routes of cluster container network in VPC

Major

Check whether the routes of cluster container VPCs are cleaned.

Cluster status is Unavailable

Major

Check whether the cluster is available.

Cluster status is Error

Major

Check whether the cluster is faulty.

Cluster status is not updated for a long time

Major

Check whether the cluster retains in a state for a long time.

Failed to update master status after upgrading cluster timeout

Major

Check whether the status of the master node is updated after the cluster upgrade timed out.

Failed to update running jobs after upgrading cluster timeout

Major

Check whether running tasks are updated after the cluster upgrade timed out.

Failed to update cluster status

Major

Check whether the cluster status is updated.

Failed to update node status

Major

Check whether the node status is updated.

Failed to remove the static node from database

Major

Check whether nodes are removed from the database after managing nodes timed out.

Failed to update node status to abnormal after node processing timeout

Major

Check whether the node status is updated to abnormal after processing the node timed out.

Failed to update the cluster endpoint

Major

Check whether the cluster endpoint is updated.

Failed to delete the unavailable connection of the Kubernetes cluster

Major

Check whether unavailable Kubernetes connections are deleted.

Failed to sync the cluster cert

Major

Check whether the cluster certificate is synchronized.