ModelServing结合Mooncake部署指南
本文档在A3昇腾集群(单超节点)上,使用vLLM-ascend和Kthena推理平台部署DeepSeek-R1-Distill-Qwen-1.5B模型的1P1D(1 Prefill + 1 Decode) 分离架构的最佳实践。该架构通过物理隔离Prefill和Decode阶段,利用Mooncake Connector实现KV Cache传输,显著优化资源利用率和推理性能。
背景信息
PD分离式部署(Mooncake + KV-Cache)相比于传统部署方式的优势如下:
核心组件说明如下。
| 组件 | 作用 |
|---|---|
| vLLM-ascend | vLLM的昇腾NPU优化版本。 |
| Kthena | 华为云模型服务编排平台,用于统一管理ModelServing实例。 |
| Mooncake Connector | KV Cache分离传输协议,实现Prefill与Decode节点间高效共享。 |
流程图

前提条件
- 已安装v1.20.15及以上版本的Volcano调度器插件,且设置默认调度器为Volcano。
- 部署前,请确认节点间的网络状态正常。您可参考Verification Process进行验证。
约束与限制
本验证流程依赖物理机(BMS)。在虚拟机(VM)上运行无法确保NPU网络通信正常,相关问题需自行解决。
操作流程
- 准备模型。
- 请自行在本地下载大模型或从华为开源镜像仓获取,放置到节点的/models/DeepSeek-R1-Distill-Qwen-1.5B目录。
- 解压模型至指定路径。
unzip <下载的模型文件> -d /models
- 创建ConfigMap。
- 创建config.yaml文件,定义Prefill和Decode启动脚本。根据所选模型调整启动脚本参数,具体请参考vLLM Ascend官方文档。
kind: ConfigMap apiVersion: v1 metadata: name: deepseek-pd-cm data: prefill.sh: | nic_name="enp23s0f3" # network card name local_ip=$POD_IP export HCCL_IF_IP=$local_ip export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export HCCL_BUFFSIZE=256 export TASK_QUEUE_ENABLE=1 export HCCL_OP_EXPANSION_MODE="AIV" export VLLM_USE_V1=1 export MOONCAKE_ENGINE_ID="${GROUP_NAME}_${ROLE_ID}" vllm serve $MODEL_LOCATION \ --host $POD_IP \ --port "7100" \ --data-parallel-size 4 \ --data-parallel-size-local 4 \ --data-parallel-address $POD_IP \ --data-parallel-rpc-port 12321 \ --tensor-parallel-size 2 \ --seed 1024 \ --served-model-name ds_r1 \ --max-model-len 40000 \ --max-num-batched-tokens 16384 \ --max-num-seqs 8 \ --enforce-eager \ --trust-remote-code \ --gpu-memory-utilization 0.9 \ --no-enable-prefix-caching \ --additional-config '{"recompute_scheduler_enable":true}' \ --kv-transfer-config \ '{"kv_connector": "MooncakeConnectorV1", "kv_role": "kv_producer", "kv_port": "28000", "engine_id": "'"${MOONCAKE_ENGINE_ID}"'", "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector", "kv_connector_extra_config": { "use_ascend_direct": true, "prefill": { "dp_size": 4, "tp_size": 2 }, "decode": { "dp_size": 4, "tp_size": 2 } } }' decode.sh: | nic_name="enp23s0f3" # network card name local_ip=$POD_IP export HCCL_IF_IP=$local_ip export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export HCCL_BUFFSIZE=600 export TASK_QUEUE_ENABLE=1 export HCCL_OP_EXPANSION_MODE="AIV" export VLLM_USE_V1=1 export MOONCAKE_ENGINE_ID="${GROUP_NAME}_${ROLE_ID}" vllm serve $MODEL_LOCATION \ --host $POD_IP \ --port "7101" \ --data-parallel-size 4 \ --data-parallel-address $POD_IP \ --data-parallel-rpc-port 12322 \ --tensor-parallel-size 2 \ --seed 1024 \ --served-model-name ds_r1 \ --max-model-len 40000 \ --max-num-batched-tokens 256 \ --max-num-seqs 40 \ --trust-remote-code \ --gpu-memory-utilization 0.94 \ --no-enable-prefix-caching \ --additional-config '{"recompute_scheduler_enable":true,"finegrained_tp_config": {"lmhead_tensor_parallel_size":2}}' \ --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \ --kv-transfer-config \ '{"kv_connector": "MooncakeConnectorV1", "kv_role": "kv_consumer", "kv_port": "28100", "engine_id": "'"${MOONCAKE_ENGINE_ID}"'", "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector", "kv_connector_extra_config": { "use_ascend_direct": true, "prefill": { "dp_size": 4, "tp_size": 2 }, "decode": { "dp_size": 4, "tp_size": 2 } } }'ConfigMap包含以下两个关键脚本:
- prefill.sh(Prefill阶段启动脚本)
# 核心配置 网络接口(nic_name): enp23s0f3 (可在节点上执行ip route | grep default命令获取) 服务端口(port): 7100 KV 端口(kv_port): 28000 数据并行度(data-parallel-size):4 张量并行度(tensor-parallel-size): 2
- decode.sh(Decode阶段启动脚本)
# 核心配置 网络接口(nic_name): enp23s0f3 服务端口(port): 7101 KV 端口(kv_port): 28100 最大并发序列数(max-num-seqs): 40
- prefill.sh(Prefill阶段启动脚本)
- 执行以下命令,
kubectl apply -f config.yaml
- 创建config.yaml文件,定义Prefill和Decode启动脚本。根据所选模型调整启动脚本参数,具体请参考vLLM Ascend官方文档。
- 部署ModelServing。
- 创建deepseek-serv.yaml文件,定义Prefill和Decode服务实例。
apiVersion: workload.serving.volcano.sh/v1alpha1 kind: ModelServing metadata: name: deepseek-pd namespace: default spec: schedulerName: volcano replicas: 1 recoveryPolicy: ServingGroupRecreate template: restartGracePeriodSeconds: 60 roles: - name: prefill replicas: 1 workerReplicas: 0 entryTemplate: spec: hostNetwork: true containers: - name: prefill image: quay.io/ascend/vllm-ascend:v0.13.0-a3 command: - /bin/bash args: - '-c' - cd /workspace && ./prefill.sh env: - name: ROLE value: "prefill" - name: GROUP_NAME valueFrom: fieldRef: fieldPath: metadata.labels['modelserving.volcano.sh/group-name'] - name: ROLE_ID valueFrom: fieldRef: fieldPath: metadata.labels['modelserving.volcano.sh/role-id'] - name: POD_IP valueFrom: fieldRef: fieldPath: status.podIP - name: NODE_IP valueFrom: fieldRef: fieldPath: status.hostIP - name: MODEL_LOCATION value: /models/DeepSeek-R1-Distill-Qwen-1.5B - name: TP_SIZE value: "2" - name: DP_SIZE value: "4" readinessProbe: httpGet: path: /health port: 7100 scheme: HTTP initialDelaySeconds: 60 periodSeconds: 10 timeoutSeconds: 2 failureThreshold: 3 resources: limits: cpu: '94' huawei.com/ascend-1980: '8' memory: 900Gi requests: cpu: '32' huawei.com/ascend-1980: '8' memory: 350Gi ports: - containerPort: 7100 name: server volumeMounts: - name: model mountPath: /models - name: dshm mountPath: /dev/shm - name: hccn-conf mountPath: /etc/hccn.conf - name: hccn-tool mountPath: /usr/local/Ascend/driver/tools/hccn_tool - name: ascend-install-info mountPath: /etc/ascend_install.info - name: config mountPath: /workspace/prefill.sh subPath: prefill.sh volumes: - name: model hostPath: path: /models type: Directory - name: dshm emptyDir: medium: Memory - name: hccn-conf hostPath: path: /etc/hccn.conf - name: hccn-tool hostPath: path: /usr/local/Ascend/driver/tools/hccn_tool - name: ascend-install-info hostPath: path: /etc/ascend_install.info - name: config configMap: name: deepseek-pd-cm defaultMode: 0777 - name: decode replicas: 1 workerReplicas: 0 entryTemplate: spec: hostNetwork: true containers: - name: decode image: quay.io/ascend/vllm-ascend:v0.13.0-a3 command: - /bin/bash args: - '-c' - cd /workspace && ./decode.sh env: - name: ROLE value: "decode" - name: ENGINE_ID valueFrom: fieldRef: fieldPath: metadata.name - name: POD_IP valueFrom: fieldRef: fieldPath: status.podIP - name: NODE_IP valueFrom: fieldRef: fieldPath: status.hostIP - name: GROUP_NAME valueFrom: fieldRef: fieldPath: metadata.labels['modelserving.volcano.sh/group-name'] - name: ROLE_ID valueFrom: fieldRef: fieldPath: metadata.labels['modelserving.volcano.sh/role-id'] - name: MODEL_LOCATION value: /models/DeepSeek-R1-Distill-Qwen-1.5B - name: TP_SIZE value: "2" - name: DP_SIZE value: "4" readinessProbe: httpGet: path: /health port: 7101 scheme: HTTP initialDelaySeconds: 60 periodSeconds: 10 timeoutSeconds: 2 failureThreshold: 3 ports: - containerPort: 7101 name: server resources: limits: cpu: '94' huawei.com/ascend-1980: '8' memory: 900Gi requests: cpu: '32' huawei.com/ascend-1980: '8' memory: 350Gi volumeMounts: - name: model mountPath: /models - name: dshm mountPath: /dev/shm - name: hccn-conf mountPath: /etc/hccn.conf - name: hccn-tool mountPath: /usr/local/Ascend/driver/tools/hccn_tool - name: ascend-install-info mountPath: /etc/ascend_install.info - name: config mountPath: /workspace/decode.sh subPath: decode.sh volumes: - name: model hostPath: path: /models type: Directory - name: dshm emptyDir: medium: Memory - name: hccn-conf hostPath: path: /etc/hccn.conf - name: hccn-tool hostPath: path: /usr/local/Ascend/driver/tools/hccn_tool - name: ascend-install-info hostPath: path: /etc/ascend_install.info - name: config configMap: name: deepseek-pd-cm defaultMode: 0777 - 执行以下命令,部署ModelServing。
kubectl apply -f deepseek-serv.yaml
关键挂载点如下表。
挂载路径
说明
/models
模型文件目录。
/dev/shm
共享内存,用于进程间通信。
/etc/hccn.conf
NPU网络配置文件。
/workspace/prefill.sh 或 /workspace/decode.sh
启动脚本路径。
- 创建deepseek-serv.yaml文件,定义Prefill和Decode服务实例。
- 配置负载均衡代理。
- 执行以下命令,下载代理服务器脚本。
wget https://raw.githubusercontent.com/vllm-project/vllm-ascend/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py
- 执行以下命令,获取Pod IP地址。
kubectl get pods -owide
返回示例。
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES deepseek-pd-0-decode-0-0 1/1 Running 0 20h 192.168.0.25 192.168.0.25 <none> <none> deepseek-pd-0-prefill-0-0 1/1 Running 0 20h 192.168.0.25 192.168.0.25 <none> <none>
- 启动代理服务器。请根据部署环境修改端口和IP地址信息。
python3 load_balance_proxy_server_example.py \ --port 8080 \ --host 0.0.0.0 \ --prefiller-hosts 192.168.0.25 \ --prefiller-ports 7100 \ --decoder-hosts 192.168.0.25 \ --decoder-ports 7101
- 执行以下命令,下载代理服务器脚本。
- 验证与测试。通过代理服务器端口和IP地址发送请求,具体端口和IP地址请根据实际情况修改。
- 发送测试请求。
curl -X POST http://192.168.0.25:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "ds_r1", "messages": [ { "role": "user", "content": "Hello, how are you?" } ], "max_tokens": 100 }'返回类似信息如下。
{ "id": "chatcmpl-53cf0580-0e68-4623-80aa-1cf0fd923034", "object": "chat.completion", "created": 1776425897, "model": "ds_r1", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "Okay, so I just received a message from someone asking, \"Hello, how are you?\" I need to respond appropriately. Let me think about the best way to handle this.\n\nFirst, I should consider the context. The user is greeting me, which is friendly. They're probably new or just reaching out for the first time. I should keep it warm and open-ended to encourage them to share more.\n\nI should acknowledge their greeting and express my greeting in a friendly manner. Maybe something like,", "refusal": null, "annotations": null, "audio": null, "function_call": null, "tool_calls": [], "reasoning": null, "reasoning_content": null }, "logprobs": null, "finish_reason": "length", "stop_reason": null, "token_ids": null } ], "service_tier": null, "system_fingerprint": null, "usage": { "prompt_tokens": 11, "total_tokens": 111, "completion_tokens": 100, "prompt_tokens_details": null }, "prompt_logprobs": null, "prompt_token_ids": null, "kv_transfer_params": null } - 查看日志输出。
- Proxy日志
INFO: Started server process [417174] INFO: Waiting for application startup. Initialized 1 prefill clients and 1 decode clients. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit) INFO: 192.168.0.25:53050 - "POST /v1/completions HTTP/1.1" 200 OK INFO: 192.168.0.25:43946 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 192.168.0.25:56470 - "POST /v1/completions HTTP/1.1" 200 OK INFO: 192.168.0.25:36910 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 192.168.0.25:51070 - "POST /v1/chat/completions HTTP/1.1" 200 OK
当返回信息中HTTP状态码为“200 OK”,表示请求处理成功。
- Prefill Pod日志
kubectl logs deepseek-pd-0-prefill-0-0
返回关键信息如下所示。检查是否有Engine 000相关的日志,以及Delaying free。如果有,说明预填充计算正常,KV Cache已生成并准备传输。
(EngineCore_DP0 pid=142) INFO 04-17 11:38:17 [mooncake_connector.py:1062] Delaying free of 1 blocks for request chatcmpl-53cf0580-0e68-4623-80aa-1cf0fd923034 (APIServer pid=7) INFO: 192.168.0.25:52240 - "POST /v1/chat/completions HTTP/1.1" 200 OK (APIServer pid=7) INFO: 192.168.1.191:34968 - "GET /metrics HTTP/1.1" 200 OK (APIServer pid=7) INFO 04-17 11:38:24 [loggers.py:248] Engine 000: Avg prompt throughput: 1.1 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 0.0%
- Decode Pod日志
kubectl logs deepseek-pd-0-decode-0-0
返回关键信息如下所示。可以查看下Avg generation throughput,如果大于0,说明模型正常运行。
(APIServer pid=8) INFO: 192.168.1.191:46710 - "GET /metrics HTTP/1.1" 200 OK I0417 11:38:17.681255 1302 ascend_direct_transport.cpp:605] Transfer to:192.168.0.25:20294, cost: 5289 us (Worker_DP0_TP0 pid=304) INFO 04-17 11:38:17 [mooncake_connector.py:561] KV cache transfer for request chatcmpl-53cf0580-0e68-4623-80aa-1cf0fd923034 took 5.85 ms (1 groups, 1 blocks). local_ip 192.168.0.25 local_device_id 0 remote_session_id 192.168.0.25:15910 I0417 11:38:17.700887 1272 ascend_direct_transport.cpp:605] Transfer to:192.168.0.25:21685, cost: 26734 us (Worker_DP0_TP1 pid=307) INFO 04-17 11:38:17 [mooncake_connector.py:561] KV cache transfer for request chatcmpl-53cf0580-0e68-4623-80aa-1cf0fd923034 took 27.27 ms (1 groups, 1 blocks). local_ip 192.168.0.25 local_device_id 1 remote_session_id 192.168.0.25:16045 (APIServer pid=8) INFO: 192.168.0.25:47876 - "POST /v1/chat/completions HTTP/1.1" 200 OK (APIServer pid=8) INFO: 192.168.0.25:47878 - "GET /health HTTP/1.1" 200 OK (APIServer pid=8) INFO 04-17 11:38:25 [loggers.py:248] Engine 000: Avg prompt throughput: 1.1 tokens/s, Avg generation throughput: 10.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 100.0%
- Proxy日志
- 发送测试请求。
常见问题
- 报错AttributeError: 'Qwen2Config' object has no attribute 'head_dim'
当遇到类似 AttributeError: 'Qwen2Config' object has no attribute 'head_dim' 的错误时,说明当前模型配置中缺少head_dim字段。该参数用于定义注意力头的维度,是模型推理过程中的关键配置。
请手动编辑模型目录下的config.json文件,在其中添加 "head_dim": 128字段。若使用的是不同规模的模型,需根据实际参数动态计算该值,计算公式为:
head_dim = hidden_size / num_attention_heads