Customizing an AOM Monitoring Panel for Inference Metrics
This section describes how to create a custom AOM monitoring panel for an inference service. This system can automatically discover workloads in Kubernetes, collect inference performance metrics in Prometheus format, such as QPS, inference latency, and resource usage, and provide flexible visualized configuration functions for key monitoring requirements in AI inference scenarios.
Prerequisites
- A cluster of v1.34 or later is available. There are GPU nodes and GPU related services in the cluster.
- You have installed the CCE AI Suite (NVIDIA GPU) and Cloud Native Cluster Monitoring add-ons.
When installing Cloud Native Cluster Monitoring, you must enable Report Monitoring Data to AOM to report Prometheus data to AOM.
Procedure
- Prepare a local LLM.
- Install the Hugging Face Python client.
pip install --trusted-host mirrors.tools.huawei.com \ -i https://mirrors.tools.huawei.com/pypi/simple \ -U huggingface_hub - Configure environment variables to prevent the download timeout.
export HF_HUB_DOWNLOAD_TIMEOUT=10000
- Create the following Python file and run it to download the required model.
from huggingface_hub import HfApi import os HF_ENDPOINT = os.getenv('HF_HUB_ENDPOINT', 'http://mirrors.tools.huawei.com/huggingface') api = HfApi(endpoint=HF_ENDPOINT) api.snapshot_download( repo_id="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", repo_type="model", revision="main", local_dir="/data/huggingface-cache/", etag_timeout=10000 ) print ("Model downloaded.")
- Install the Hugging Face Python client.
- Deploy the vLLM service and configure monitoring.
- Create a vLLM service configuration file.
Use the YAML file below to create a deployment file (for example, vllm-deepseek.yaml) of the vLLM service for deploying the DeepSeek-R1-Qwen model.
An open-source image is used. If the pull fails due to network problems, check your network or proxy settings.
apiVersion: apps/v1 kind: Deployment metadata: name: deepseek-r1-qwen namespace: default labels: app: deepseek-r1-qwen spec: replicas: 1 selector: matchLabels: app: deepseek-r1-qwen template: metadata: labels: app: deepseek-r1-qwen spec: volumes: - name: model-volume hostPath: path: /data/huggingface-cache/DeepSeek-R1-Distill-Qwen-1.5B type: Directory - name: shm emptyDir: medium: Memory sizeLimit: "2Gi" containers: - name: deepseek-r1-qwen image: vllm/vllm-openai:nightly command: ["/bin/sh", "-c"] args: [ "vllm serve /models/DeepSeek-R1-Distill-Qwen-1.5B --trust-remote-code --max-model-len 4096" ] ports: - containerPort: 8000 name: http resources: limits: cpu: "8" memory: 12G nvidia.com/gpu: "1" requests: cpu: "2" memory: 4G nvidia.com/gpu: "1" volumeMounts: - mountPath: /models/DeepSeek-R1-Distill-Qwen-1.5B name: model-volume readOnly: true - name: shm mountPath: /dev/shm livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 120 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 120 periodSeconds: 5 timeoutSeconds: 5 failureThreshold: 3 - Deploy and verify the service status.
kubectl create -f vllm-deepseek.yaml kubectl get pods -w
An example of the returned result is shown below. Ensure that the pod status is Running.
NAME READY STATUS RESTARTS AGE deepseek-r1-qwen-84b95998-qtwgn 1/1 Running 0 78m
- Configure Prometheus monitoring (a PodMonitor).
Use the YAML file below to create a PodMonitor so that Prometheus can capture metrics of the vLLM service.
apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: vllm-deepseek-metrics namespace: default labels: app: deepseek-r1-qwen spec: selector: matchLabels: app: deepseek-r1-qwen podMetricsEndpoints: - port: http path: /metrics interval: 30s
- Create a vLLM service configuration file.
- Check whether the LLM service is running properly.
- Expose the deployed vLLM service to a local port through port forwarding.
kubectl port-forward deployment/deepseek-r1-qwen 8000:8000
The following is an example of the returned result:
Forwarding from 127.0.0.1:8000 -> 8000 Forwarding from [::1]:8000 -> 8000
Keep the terminal running. Subsequent requests will be sent through localhost:8000.
- Send a test request in the new terminal.
Open a new terminal window and run the following command to send a test request to the model:
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/models/DeepSeek-R1-Distill-Qwen-1.5B", "messages": [{"role": "user", "content": "Hello, how are you?"}], "max_tokens": 50 }' - Observe the port output.
Forwarding from 127.0.0.1:8000 -> 8000 Forwarding from [::1]:8000 -> 8000 Handling connection for 8000
- Observe the model output.
In normal cases, you will receive a JSON response similar to the following:
{ "id": "chatcmpl-10321d932eaf4f8f820241c6308e****", "object": "chat.completion", "created": 1766490573, "model": "/models/DeepSeek-R1-Distill-Qwen-1.5B", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "Alright, the user greeted me with \"Hello, how are you?\" which is a common and friendly way to start a conversation. I should respond in a similar tone to keep the conversation going smoothly.\n\nI want to make sure I acknowledge their greeting and", "refusal": null, "annotations": null, "audio": null, "function_call": null, "tool_calls": [], "reasoning_content": null }, "logprobs": null, "finish_reason": "length", "stop_reason": null, "token_ids": null } ], "service_tier": null, "system_fingerprint": null, "usage": { "prompt_tokens": 11, "total_tokens": 61, "completion_tokens": 50 }, "prompt_logprobs": null, "prompt_token_ids": null, "kv_transfer_params": null }
- Expose the deployed vLLM service to a local port through port forwarding.
- Configure a collection policy, metric reporting, and query.
- Configure a collection policy.
- Log in to the CCE console and click the cluster name to access the cluster console. In the navigation pane, choose Settings. In the right pane, click the Monitoring tab.
- Click Manage under PodMonitor Policies.
- On the PodMonitor Policies tab, enable the PodMonitor collection policy for the pod.

- Check whether the collection endpoint has been enabled.
- Log in to the AOM console. In the navigation pane, choose Metric Browsing. In the right pane, click the Metric Sources tab, select the corresponding Prometheus instance, and query the vLLM metrics.

- Configure a collection policy.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot
