自定义AOM推理指标监控面板
通过本教程,您将学习如何构建一个用于AOM推理服务的自定义监控面板。该系统能够自动发现Kubernetes中的工作负载,采集Prometheus格式的推理性能指标(例如,QPS、推理延迟、资源利用率等),并提供灵活的可视化配置功能,全面满足AI推理场景下的关键监控需求。
前提条件
- 已创建1.34及以上版本的集群,且集群中包含GPU节点,并已运行GPU相关业务。
- 已安装CCE AI套件(NVIDIA GPU)和云原生监控插件。
安装云原生监控插件时,须开启“监控数据上报至AOM服务”,以便将普罗数据上报至AOM服务。
操作步骤
- 准备本地大语言模型。
该示例需要Python 3.8或更高版本。
- 安装HuggingFace Python客户端。
pip install --trusted-host mirrors.tools.huawei.com \ -i https://mirrors.tools.huawei.com/pypi/simple \ -U huggingface_hub - 设置环境变量,防止下载超时。
export HF_HUB_DOWNLOAD_TIMEOUT=10000
- 创建目标目录。
mkdir -p /data/huggingface-cache/DeepSeek-R1-Distill-Qwen-1.5B
- 创建以下Python文件并运行,以下载所需模型。
from huggingface_hub import HfApi HF_ENDPOINT = 'http://mirrors.tools.huawei.com/huggingface' api = HfApi(endpoint=HF_ENDPOINT) api.snapshot_download( repo_id="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", repo_type="model", revision="main", local_dir="/data/huggingface-cache/DeepSeek-R1-Distill-Qwen-1.5B", etag_timeout=10000 )
- 安装HuggingFace Python客户端。
- 部署vLLM服务并配置监控。
- 使用以下YAML创建一个vLLM负载。
- 创建的负载必须运行在已下载模型的节点上。
- 当前容器镜像使用的是开源镜像,若因网络等问题导致拉取失败,请自行检查您的网络或代理设置。
apiVersion: apps/v1 kind: Deployment metadata: name: deepseek-r1-qwen namespace: default labels: app: deepseek-r1-qwen spec: replicas: 1 selector: matchLabels: app: deepseek-r1-qwen template: metadata: labels: app: deepseek-r1-qwen spec: volumes: - name: model-volume hostPath: path: /data/huggingface-cache/DeepSeek-R1-Distill-Qwen-1.5B type: Directory - name: shm emptyDir: medium: Memory sizeLimit: "2Gi" containers: - name: deepseek-r1-qwen image: vllm/vllm-openai:nightly command: ["/bin/sh", "-c"] args: [ "vllm serve /models/DeepSeek-R1-Distill-Qwen-1.5B --trust-remote-code --max-model-len 4096" ] ports: - containerPort: 8000 name: http resources: limits: cpu: "8" memory: 12G nvidia.com/gpu: "1" requests: cpu: "2" memory: 4G nvidia.com/gpu: "1" volumeMounts: - mountPath: /models/DeepSeek-R1-Distill-Qwen-1.5B name: model-volume readOnly: true - name: shm mountPath: /dev/shm livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 120 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 120 periodSeconds: 5 timeoutSeconds: 5 failureThreshold: 3 - 观察负载的运行状态。
kubectl get pods -w
返回示例如下,显示Pod状态为Running。
NAME READY STATUS RESTARTS AGE deepseek-r1-qwen-84b95998-qtwgn 1/1 Running 0 78m
- 使用以下YAML创建一个PodMonitor。
apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: vllm-deepseek-metrics namespace: default labels: app: deepseek-r1-qwen spec: selector: matchLabels: app: deepseek-r1-qwen podMetricsEndpoints: - port: http path: /metrics interval: 30s
- 使用以下YAML创建一个vLLM负载。
- 验证大语言模型服务是否正常运行。
- 通过端口转发将部署的vLLM服务暴露到本地端口。
kubectl port-forward deployment/deepseek-r1-qwen 8000:8000
返回示例如下。
Forwarding from 127.0.0.1:8000 -> 8000 Forwarding from [::1]:8000 -> 8000
保持终端运行,后续请求将通过localhost:8000发送。
- 在新终端中发送测试请求。
打开一个新终端窗口,执行以下 `curl` 命令,向模型发送测试请求。
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/models/DeepSeek-R1-Distill-Qwen-1.5B", "messages": [{"role": "user", "content": "Hello, how are you?"}], "max_tokens": 50 }' - 观察端口输出。
Forwarding from 127.0.0.1:8000 -> 8000 Forwarding from [::1]:8000 -> 8000 Handling connection for 8000
- 观察模型输出。
{"id":"chatcmpl-10321d932eaf4f8f820241c630******","object":"chat.completion","created":1766490573,"model":"/models/DeepSeek-R1-Distill-Qwen-1.5B","choices":[{"index":0,"message":{"role":"assistant","content":"Alright, the user greeted me with \"Hello, how are you?\" which is a common and friendly way to start a conversation. I should respond in a similar tone to keep the conversation going smoothly.\n\nI want to make sure I acknowledge their greeting and","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":11,"total_tokens":61,"completion_tokens":50,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
- 通过端口转发将部署的vLLM服务暴露到本地端口。
- 配置采集并进行指标上报与查询。


