Help Center/ Cloud Container Engine/ Best Practices/ Cloud Native AI/ Customizing an AOM Monitoring Panel for Inference Metrics
Updated on 2026-03-10 GMT+08:00

Customizing an AOM Monitoring Panel for Inference Metrics

This section describes how to create a custom AOM monitoring panel for an inference service. This system can automatically discover workloads in Kubernetes, collect inference performance metrics in Prometheus format, such as QPS, inference latency, and resource usage, and provide flexible visualized configuration functions for key monitoring requirements in AI inference scenarios.

Prerequisites

  • A cluster of v1.34 or later is available. There are GPU nodes and GPU related services in the cluster.
  • You have installed the CCE AI Suite (NVIDIA GPU) and Cloud Native Cluster Monitoring add-ons.

    When installing Cloud Native Cluster Monitoring, you must enable Report Monitoring Data to AOM to report Prometheus data to AOM.

Procedure

  1. Prepare a local LLM.

    1. Install the Hugging Face Python client.
      pip install --trusted-host mirrors.tools.huawei.com \
          -i https://mirrors.tools.huawei.com/pypi/simple \
          -U huggingface_hub
    2. Configure environment variables to prevent the download timeout.
      export HF_HUB_DOWNLOAD_TIMEOUT=10000
    3. Create the following Python file and run it to download the required model.
      from huggingface_hub import HfApi
      import os
      
      HF_ENDPOINT = os.getenv('HF_HUB_ENDPOINT', 'http://mirrors.tools.huawei.com/huggingface')
      api = HfApi(endpoint=HF_ENDPOINT)
      api.snapshot_download(
          repo_id="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
          repo_type="model",
          revision="main",
          local_dir="/data/huggingface-cache/",
          etag_timeout=10000
      )
      
      print ("Model downloaded.")

  2. Deploy the vLLM service and configure monitoring.

    1. Create a vLLM service configuration file.

      Use the YAML file below to create a deployment file (for example, vllm-deepseek.yaml) of the vLLM service for deploying the DeepSeek-R1-Qwen model.

      An open-source image is used. If the pull fails due to network problems, check your network or proxy settings.

      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: deepseek-r1-qwen
        namespace: default
        labels:
          app: deepseek-r1-qwen
      spec:
        replicas: 1
        selector:
          matchLabels:
            app: deepseek-r1-qwen
        template:
          metadata:
            labels:
              app: deepseek-r1-qwen
          spec:
            volumes:
            - name: model-volume
              hostPath:
                path: /data/huggingface-cache/DeepSeek-R1-Distill-Qwen-1.5B
                type: Directory
            - name: shm
              emptyDir:
                medium: Memory
                sizeLimit: "2Gi"
            containers:
            - name: deepseek-r1-qwen
              image: vllm/vllm-openai:nightly
              command: ["/bin/sh", "-c"]
              args: [
                "vllm serve /models/DeepSeek-R1-Distill-Qwen-1.5B --trust-remote-code --max-model-len 4096"
              ]
              ports:
              - containerPort: 8000
                name: http
              resources:
                limits:
                  cpu: "8"
                  memory: 12G
                  nvidia.com/gpu: "1"
                requests:
                  cpu: "2"
                  memory: 4G
                  nvidia.com/gpu: "1"
              volumeMounts:
              - mountPath: /models/DeepSeek-R1-Distill-Qwen-1.5B
                name: model-volume
                readOnly: true
              - name: shm
                mountPath: /dev/shm
              livenessProbe:
                httpGet:
                  path: /health
                  port: 8000
                initialDelaySeconds: 120
                periodSeconds: 10
                timeoutSeconds: 5
                failureThreshold: 3
              readinessProbe:
                httpGet:
                  path: /health
                  port: 8000
                initialDelaySeconds: 120
                periodSeconds: 5
                timeoutSeconds: 5
                failureThreshold: 3
    2. Deploy and verify the service status.
      kubectl create -f vllm-deepseek.yaml
      kubectl get pods -w

      An example of the returned result is shown below. Ensure that the pod status is Running.

      NAME                              READY   STATUS                     RESTARTS   AGE
      deepseek-r1-qwen-84b95998-qtwgn   1/1     Running                    0          78m
    3. Configure Prometheus monitoring (a PodMonitor).
      Use the YAML file below to create a PodMonitor so that Prometheus can capture metrics of the vLLM service.
      apiVersion: monitoring.coreos.com/v1
      kind: PodMonitor
      metadata:
        name: vllm-deepseek-metrics
        namespace: default
        labels:
          app: deepseek-r1-qwen
      spec:
        selector:
          matchLabels:
            app: deepseek-r1-qwen
        podMetricsEndpoints:
        - port: http
          path: /metrics
          interval: 30s

  3. Check whether the LLM service is running properly.

    1. Expose the deployed vLLM service to a local port through port forwarding.
      kubectl port-forward deployment/deepseek-r1-qwen 8000:8000

      The following is an example of the returned result:

      Forwarding from 127.0.0.1:8000 -> 8000
      Forwarding from [::1]:8000 -> 8000

      Keep the terminal running. Subsequent requests will be sent through localhost:8000.

    2. Send a test request in the new terminal.

      Open a new terminal window and run the following command to send a test request to the model:

      curl http://localhost:8000/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
          "model": "/models/DeepSeek-R1-Distill-Qwen-1.5B",
          "messages": [{"role": "user", "content": "Hello, how are you?"}],
          "max_tokens": 50
        }'
    3. Observe the port output.
      Forwarding from 127.0.0.1:8000 -> 8000
      Forwarding from [::1]:8000 -> 8000
      Handling connection for 8000
    4. Observe the model output.

      In normal cases, you will receive a JSON response similar to the following:

      {
        "id": "chatcmpl-10321d932eaf4f8f820241c6308e****",
        "object": "chat.completion",
        "created": 1766490573,
        "model": "/models/DeepSeek-R1-Distill-Qwen-1.5B",
        "choices": [
          {
            "index": 0,
            "message": {
              "role": "assistant",
              "content": "Alright, the user greeted me with \"Hello, how are you?\" which is a common and friendly way to start a conversation. I should respond in a similar tone to keep the conversation going smoothly.\n\nI want to make sure I acknowledge their greeting and",
              "refusal": null,
              "annotations": null,
              "audio": null,
              "function_call": null,
              "tool_calls": [],
              "reasoning_content": null
            },
            "logprobs": null,
            "finish_reason": "length",
            "stop_reason": null,
            "token_ids": null
          }
        ],
        "service_tier": null,
        "system_fingerprint": null,
        "usage": {
          "prompt_tokens": 11,
          "total_tokens": 61,
          "completion_tokens": 50
        },
        "prompt_logprobs": null,
        "prompt_token_ids": null,
        "kv_transfer_params": null
      }

  4. Configure a collection policy, metric reporting, and query.

    1. Configure a collection policy.
      1. Log in to the CCE console and click the cluster name to access the cluster console. In the navigation pane, choose Settings. In the right pane, click the Monitoring tab.
      2. Click Manage under PodMonitor Policies.
      3. On the PodMonitor Policies tab, enable the PodMonitor collection policy for the pod.

    2. Check whether the collection endpoint has been enabled.
      1. On the Monitoring tab, click View Details under Targets.
      2. In the window that slides out from the right, check whether the Prometheus collection endpoint is normal.

    3. Log in to the AOM console. In the navigation pane, choose Metric Browsing. In the right pane, click the Metric Sources tab, select the corresponding Prometheus instance, and query the vLLM metrics.