Building a Cloud-Native Large Model Monitoring Dashboard Based on Mainstream Inference Engines Such as vLLM and SGLang

As large language models (LLMs) are increasingly deployed at scale in production environments, building high-concurrency, low-latency inference services has become a core requirement. Mainstream inference engines such as vLLM and SGLang have become the preferred foundation for enterprise LLM deployments, thanks to technologies such as PageAttention. However, large model inference is a compute-intensive task that operates as a black box. Enterprises typically face the following operations and maintenance challenges in production:

Delayed awareness of throughput and queue status: The internal state of request queues, including the number of waiting and running requests, cannot be detected in real time, making it difficult to determine whether the system is overloaded or congested.
Token-level energy efficiency cannot be quantified: There is no monitoring for LLM-specific performance metrics such as Time to First Token (TTFT) and Inter-Token Latency (ITL).
Disconnection between GPU memory fragmentation and compute utilization: KV Cache memory usage cannot be analyzed alongside actual GPU compute utilization. This makes it difficult to optimize resource scheduling and deployment policies.

Huawei Cloud CCE is fully compatible with mainstream high-performance open-source inference engines such as vLLM and SGLang. The following sections use vLLM as an example to describe how to use CCE Cloud Native Cluster Monitoring (Prometheus) to automatically detect and seamlessly scrape the Prometheus metrics exposed by the foundation model inference engine through declarative configuration.

Prerequisites

A cluster of v1.34 or later is available. There are GPU nodes and GPU related services in the cluster.
You have installed the CCE AI Suite (NVIDIA GPU) and Cloud Native Cluster Monitoring add-ons.
When installing Cloud Native Cluster Monitoring, you must enable Report Monitoring Data to AOM to report Prometheus data to AOM.

Procedure

Prepare a local LLM.

This example requires Python 3.8 or later.

Install the Hugging Face Python client.

pip install --trusted-host mirrors.tools.huawei.com \
    -i https://mirrors.tools.huawei.com/pypi/simple \
    -U huggingface_hub

Configure environment variables to prevent the download timeout.
```
export HF_HUB_DOWNLOAD_TIMEOUT=10000
```

Create a target directory.

mkdir -p /data/huggingface-cache/DeepSeek-R1-Distill-Qwen-1.5B

Create the following Python file and run it to download the required model.

from huggingface_hub import HfApi

HF_ENDPOINT = 'http://mirrors.tools.huawei.com/huggingface'
api = HfApi(endpoint=HF_ENDPOINT)
api.snapshot_download(
    repo_id="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
    repo_type="model",
    revision="main",
    local_dir="/data/huggingface-cache/DeepSeek-R1-Distill-Qwen-1.5B",
    etag_timeout=10000
)

Deploy the vLLM service and configure monitoring.

Use the following YAML to create a vLLM workload:

The created workload must run on a node where the model has been downloaded.
An open-source image is used for the container. If the pull fails due to issues such as network problems, check your network or proxy settings.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deepseek-r1-qwen
  namespace: default
  labels:
    app: deepseek-r1-qwen
spec:
  replicas: 1
  selector:
    matchLabels:
      app: deepseek-r1-qwen
  template:
    metadata:
      labels:
        app: deepseek-r1-qwen
    spec:
      volumes:
      - name: model-volume
        hostPath:
          path: /data/huggingface-cache/DeepSeek-R1-Distill-Qwen-1.5B
          type: Directory
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "2Gi"
      containers:
      - name: deepseek-r1-qwen
        image: vllm/vllm-openai:nightly
        command: ["/bin/sh", "-c"]
        args: [
          "vllm serve /models/DeepSeek-R1-Distill-Qwen-1.5B --trust-remote-code --max-model-len 4096"
        ]
        ports:
        - containerPort: 8000
          name: http
        resources:
          limits:
            cpu: "8"
            memory: 12G
            nvidia.com/gpu: "1"
          requests:
            cpu: "2"
            memory: 4G
            nvidia.com/gpu: "1"
        volumeMounts:
        - mountPath: /models/DeepSeek-R1-Distill-Qwen-1.5B
          name: model-volume
          readOnly: true
        - name: shm
          mountPath: /dev/shm
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 5
          timeoutSeconds: 5
          failureThreshold: 3

Check the pod status.

kubectl get pods -w

Information similar to the following is displayed. The pod status is Running.

NAME                              READY   STATUS                     RESTARTS   AGE
deepseek-r1-qwen-84b95998-qtwgn   1/1     Running                    0          78m

Use the following YAML to create PodMonitor:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: vllm-deepseek-metrics
  namespace: default
  labels:
    app: deepseek-r1-qwen
spec:
  selector:
    matchLabels:
      app: deepseek-r1-qwen
  podMetricsEndpoints:
  - port: http
    path: /metrics
    interval: 30s

Check whether the LLM service is running properly.

Expose the deployed vLLM service to a local port through port forwarding.
```
kubectl port-forward deployment/deepseek-r1-qwen 8000:8000
```
The following is an example of the returned result:
```
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
```
Keep the terminal running. Subsequent requests will be sent through localhost:8000.

Send a test request in the new terminal.

Open a new terminal window and run the following command to send a test request to the model:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/models/DeepSeek-R1-Distill-Qwen-1.5B",
    "messages": [{"role": "user", "content": "Hello, how are you?"}],
    "max_tokens": 50
  }'

Observe the port output.

Normally, information similar to the following is displayed:

Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
Handling connection for 8000

Observe the model output.

In normal cases, you will receive a JSON response similar to the following:

{"id":"chatcmpl-10321d932eaf4f8f820241c630******","object":"chat.completion","created":1766490573,"model":"/models/DeepSeek-R1-Distill-Qwen-1.5B","choices":[{"index":0,"message":{"role":"assistant","content":"Alright, the user greeted me with \"Hello, how are you?\" which is a common and friendly way to start a conversation. I should respond in a similar tone to keep the conversation going smoothly.\n\nI want to make sure I acknowledge their greeting and","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":11,"total_tokens":61,"completion_tokens":50,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

Configure a collection policy, metric reporting, and query.
1. Configure a collection policy.
  1. Log in to the CCE console and click the cluster name to access the cluster console. In the navigation pane, choose Settings. In the right pane, click the Monitoring tab.
  2. Click Manage under PodMonitor Policies.
  3. On the PodMonitor Policies tab, enable the PodMonitor collection policy for the pod.
2. Check whether the collection endpoint has been enabled.
  1. On the Monitoring tab, click View Details under Targets.
  2. In the window that slides out from the right, check whether the Prometheus collection endpoint is normal.
3. Log in to the AOM console. In the navigation pane, choose Metric Browsing. In the right pane, click the Metric Sources tab, select the corresponding Prometheus instance, and query the vLLM metrics.