Help Center/ Cloud Container Engine/ User Guide/ O&M/ O&M Best Practices/ Component OOM of the Cloud Native Cluster Monitoring Add-on

Updated on 2025-09-05 GMT+08:00

View PDF

Component OOM of the Cloud Native Cluster Monitoring Add-on

The Cloud Native Cluster Monitoring add-on in a cluster may have OOM errors when the cluster is running. This section describes how to handle OOM errors.

If the components of this add-on have OOM errors, refer to:

prometheus-lightweight OOM
prometheus-server OOM
thanos-query and thanos-sidecar OOM
kube-state-metrics OOM

Preparations

Ensure that the memory of the node where workloads are deployed is sufficient to prevent OOM caused by insufficient node memory.

prometheus-lightweight OOM

prometheus-lightweight exists only when the local storage is disabled.

prometheus-lightweight is a lightweight collection component. Generally, there is no OOM error. If there is an OOM error, the remote storage (AOM or third-party storage) may be abnormal. You are advised to view the logs and dashboard of prometheus-lightweight.

Go to the StatefulSets tab (Workloads > StatefulSets), select the monitoring namespace, locate prometheus-lightweight, and click View Log in the Operation column to locate the fault.
Go to the Dashboard tab (Monitoring Center > Dashboard), select Prometheus-Agent View, and check Bytes Pending for Remote Write. The value of this metric must be less than 100 MiB for a long time.

prometheus-server OOM

prometheus-server exists only when the local storage is enabled.

prometheus-server stores metrics of the last two hours in the memory for quick query. As the number of metrics increases, the memory usage increases accordingly.

If prometheus-server has an OOM error, it may persist. This is because the full metrics of the last two hours need to be loaded from a write-ahead log (WAL) to the memory each time prometheus-server starts. This process usually consumes more memory than that when prometheus-server runs normally.

On the add-ons page, you can increase the memory limit of prometheus-server by 50% to 100% and then observe prometheus-server. (The prometheus-server startup may take 5 to 30 minutes, depending on the total number of metrics and disk performance.)

thanos-query and thanos-sidecar OOM

thanos-query and thanos-sidecar exist only when the local storage and high availability are enabled. thanos-query deduplicates data from high-availability pairs.

If a large number of long-term queries are executed or there are a large number of concurrent queries, thanos-query and thanos-sidecar may have OOM errors.

On the add-ons page, you can increase the CPU and memory limits for thanos-query and thanos-sidecar by 100%.