Updated on 2025-08-18 GMT+08:00

kube-prometheus-stack

Introduction

The Cloud Native Cluster Monitoring add-on (formerly kube-prometheus-stack) uses Prometheus-operator and Prometheus and provides easy-to-use, end-to-end Kubernetes cluster monitoring.

This add-on enables monitoring data to connect to the monitoring center, so that you can view monitoring data and configure alarms on the console.

Open-source community: https://github.com/prometheus/prometheus

Scenario

  • If you need to report resource pool metrics to a third-party platform, you can install this plug-in. If this plug-in is not installed, metrics will be reported to Huawei Cloud AOM.

If neither of the preceding scenarios applies, you do not need to install the plug-in.

Constraints

The kube-state-metrics component of the plug-in does not collect labels or annotations of Kubernetes resources.

Permissions

The node-exporter component of this plug-in needs to read the Docker info data from the /var/run/docker.sock directory on the host for monitoring the Docker disk space.

The cap_dac_override privilege is required for running node-exporter to read Docker info data.

Installing a Plug-in

  1. Log in to the ModelArts console. In the navigation pane on the left, choose Standard Cluster under Resource Management.
  2. Click the resource pool name to access its details page.
  3. On the resource pool details page, click the Plug-ins tab.
  4. Locate the plug-in to be installed in the list and click Install.
    Figure 1 Installing a plug-in
    Table 1 Parameters

    Parameter

    Sub-Parameter

    Description

    Data Storage Configuration (Select at Least One Item)

    Report Monitoring Data to a Third-Party Platform

    To report Prometheus data to a third-party monitoring system, enter the address and token of the third-party monitoring system and determine whether to skip certificate authentication.

    • Data Reporting Address: Enter a complete RemoteWrite address of a third-party platform or Huawei Cloud Prometheus instance.
    • Authentication Mode: Select the authentication mode for reporting monitoring data and enter the username and password of the third-party platform in each authentication mode.

    Local data storage

    Local data storage: Monitoring data is stored within the cluster for local metric-based queries. This will use significant CPU, memory, and disk resources. The amount is directly proportional to the size of the cluster and the number of custom metrics being used.

    No local data storage: Monitoring data is stored outside of the cluster, in either AOM or a third-party monitoring system, to save cluster resources.

    If this function is enabled, monitoring data is stored in attached EVS or DSS disks so that functions relying on local monitoring data (for example, HPA using custom metrics) can run properly. If you uninstall the plug-in, data on the EVS disk is automatically deleted. To use HPA function, enable this option. Created storage volumes will be billed and counted towards the storage quota.

    Once enabled, you need to set the EVS disk type and capacity.

    Local data storage: Monitoring data is stored within the cluster for local metric-based queries. This will use significant CPU, memory, and disk resources. The amount is directly proportional to the size of the cluster and the number of custom metrics being used.

    No local data storage: Monitoring data is stored outside of the cluster, in either AOM or a third-party monitoring system, to save cluster resources.

    Specifications

    Plug-in Version

    Specify the version of the plug-in to be deployed.

    Plug-in Specifications

    Preset:

    • Demo: applicable to clusters with less than 100 Pods
    • Small: applicable to clusters with less than 2,000 Pods
    • Medium: applicable to clusters with less than 5,000 Pods
    • Large: applicable to clusters with more than 5,000 Pods.

    Custom: Set the CPU and memory quotas as required. Ensure the cluster has sufficient node resources. Otherwise, plug-in instances will fail to be scheduled.

    High Availability

    Select Local data storage. Deploy prometheusOperator, prometheus-server, alertmanager, thanosSidecar, thanosQuery, adapter, and kubeStateMetrics in multi-instance mode.

    Alternatively, do not select Local data storage, and deploy prometheusOperator, prometheus-lightweight, and kubeStateMetrics in multi-instance mode.

    The supported deployment modes vary depending on the component. For details, see Components.

    Configuration List

    Detailed configurations of the specified specifications.

    Parameter Settings

    Collection Period (s)

    If the cluster scale is large (≥ 200 nodes or ≥ 10,000 Pods), set this parameter to 60 or 30.

    Data Retention Period

    Set the data retention period.

    Advanced Settings

    • Node-Exporter Listening Port: This port uses the host network and allows Prometheus to collect metrics from the node where the port is located. If the port conflicts with another application port, you can modify it.
    • Scheduling Policy: The component can run on the node that tolerates the configured taint.

      After Prometheus is upgraded, you can configure node affinity and taint tolerance for the Prometheus plug-in group on this page. Currently, only taint key-level tolerance is supported. This allows components to run on nodes with a matching taint key.

      You can add multiple scheduling policies. Policies with specific component names take priority over the global policy. By default, a scheduling policy is disabled if its affinity node key or toleration node taint key is not configured.

  5. Read "Usage Notes" and select I have read and understand the preceding information.
  6. Click OK.

Components

Component

Description

Deployment Mode

Resource Type

prometheusOperator

(workload name: prometheus-operator)

Deploys and manages the Prometheus Server based on custom resource definitions (CRDs), and monitors and processes the events related to these CRDs. It is the control center of the entire system.

All

Deployment

prometheus

(workload name: prometheus-server)

A Prometheus Server cluster deployed by the operator based on the Prometheus CRDs that can be regarded as StatefulSets.

All

StatefulSet

alertmanager

(workload name: alertmanager-alertmanager)

Alarm center of the plug-in. It receives alarms sent by Prometheus and manages alarm information by deduplicating, grouping, and distributing.

Local data storage enabled

StatefulSet

thanosSidecar

Available only in HA mode. Runs with prometheus-server in the same pod to implement persistent storage of Prometheus metric data.

Local data storage enabled

Container

thanosQuery

Entry for PromQL query when Prometheus is in HA scenarios. It can delete duplicate metrics from Store or Prometheus.

Local data storage enabled

Deployment

adapter

(workload name: custom-metrics-apiserver)

Aggregates custom metrics to the native Kubernetes API server.

Local data storage enabled

Deployment

kubeStateMetrics

(workload name: kube-state-metrics)

Converts the Prometheus metric data into a format that can be identified by Kubernetes APIs. By default, the kube-state-metrics component does not collect all labels or annotations of Kubernetes resources. To collect such information, see Collecting All Labels and Annotations of a Pod.

If the components run in multiple pods, only one pod provides metrics.

All

Deployment

nodeExporter

(workload name: node-exporter)

Deployed on each node to collect node monitoring data.

All

DaemonSet