文档首页/ 云容器引擎 CCE/ 最佳实践/ 云原生AI/ ModelServing结合Mooncake部署指南

更新时间：2026-07-10 GMT+08:00

ModelServing结合Mooncake部署指南

本文档在A3昇腾集群（单超节点）上，使用vLLM-ascend和Kthena推理平台部署DeepSeek-R1-Distill-Qwen-1.5B模型的1P1D（1 Prefill + 1 Decode）分离架构的最佳实践。该架构通过物理隔离Prefill和Decode阶段，利用Mooncake Connector实现KV Cache传输，显著优化资源利用率和推理性能。

背景信息

PD分离式部署（Mooncake + KV-Cache）相比于传统部署方式的优势如下：

物理隔离：Prefill与Decode运行在独立节点，互不干扰。
降低延迟抖动：避免长Prompt处理阻塞其他请求的Token生成。
支持高并发：通过KV Cache分离传输，实现高效资源调度。

核心组件说明如下。

组件	作用
vLLM-ascend	vLLM的昇腾NPU优化版本。
Kthena	华为云模型服务编排平台，用于统一管理ModelServing实例。
Mooncake Connector	KV Cache分离传输协议，实现Prefill与Decode节点间高效共享。

流程图

点击放大

前提条件

已安装v1.20.15及以上版本的Volcano调度器插件，且设置默认调度器为Volcano。
部署前，请确认节点间的网络状态正常。您可参考Verification Process进行验证。

约束与限制

本验证流程依赖物理机（BMS）。在虚拟机（VM）上运行无法确保NPU网络通信正常，相关问题需自行解决。

操作流程

准备模型。
1. 请自行在本地下载大模型或从华为开源镜像仓获取，放置到节点的/models/DeepSeek-R1-Distill-Qwen-1.5B目录。
2. 解压模型至指定路径。
```
unzip <下载的模型文件> -d /models
```

创建ConfigMap。

创建config.yaml文件，定义Prefill和Decode启动脚本。根据所选模型调整启动脚本参数，具体请参考vLLM Ascend官方文档。

kind: ConfigMap
apiVersion: v1
metadata:
  name: deepseek-pd-cm
data:
  prefill.sh: |
    nic_name="enp23s0f3"  # network card name
    local_ip=$POD_IP
    export HCCL_IF_IP=$local_ip
    export GLOO_SOCKET_IFNAME=$nic_name
    export TP_SOCKET_IFNAME=$nic_name
    export HCCL_SOCKET_IFNAME=$nic_name
    export OMP_PROC_BIND=false
    export OMP_NUM_THREADS=10
    export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
    export HCCL_BUFFSIZE=256
    export TASK_QUEUE_ENABLE=1
    export HCCL_OP_EXPANSION_MODE="AIV"
    export VLLM_USE_V1=1
    export MOONCAKE_ENGINE_ID="${GROUP_NAME}_${ROLE_ID}"
    vllm serve $MODEL_LOCATION \
      --host $POD_IP \
      --port "7100" \
      --data-parallel-size 4 \
      --data-parallel-size-local 4 \
      --data-parallel-address $POD_IP \
      --data-parallel-rpc-port 12321 \
      --tensor-parallel-size 2 \
      --seed 1024 \
      --served-model-name ds_r1 \
      --max-model-len 40000 \
      --max-num-batched-tokens 16384 \
      --max-num-seqs 8 \
      --enforce-eager \
      --trust-remote-code \
      --gpu-memory-utilization 0.9  \
      --no-enable-prefix-caching \
      --additional-config '{"recompute_scheduler_enable":true}' \
      --kv-transfer-config \
      '{"kv_connector": "MooncakeConnectorV1",
      "kv_role": "kv_producer",
      "kv_port": "28000",
      "engine_id": "'"${MOONCAKE_ENGINE_ID}"'",
      "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
      "kv_connector_extra_config": {
                "use_ascend_direct": true,
                "prefill": {
                        "dp_size": 4,
                        "tp_size": 2
                },
                "decode": {
                        "dp_size": 4,
                        "tp_size": 2
                }
          }
      }'
  decode.sh: |
    nic_name="enp23s0f3"  # network card name
    local_ip=$POD_IP
    export HCCL_IF_IP=$local_ip
    export GLOO_SOCKET_IFNAME=$nic_name
    export TP_SOCKET_IFNAME=$nic_name
    export HCCL_SOCKET_IFNAME=$nic_name
    export OMP_PROC_BIND=false
    export OMP_NUM_THREADS=10
    export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
    export HCCL_BUFFSIZE=600
    export TASK_QUEUE_ENABLE=1
    export HCCL_OP_EXPANSION_MODE="AIV"
    export VLLM_USE_V1=1
    export MOONCAKE_ENGINE_ID="${GROUP_NAME}_${ROLE_ID}"
    vllm serve $MODEL_LOCATION \
      --host $POD_IP \
      --port "7101" \
      --data-parallel-size 4 \
      --data-parallel-address $POD_IP \
      --data-parallel-rpc-port 12322 \
      --tensor-parallel-size 2 \
      --seed 1024 \
      --served-model-name ds_r1 \
      --max-model-len 40000 \
      --max-num-batched-tokens 256 \
      --max-num-seqs 40 \
      --trust-remote-code \
      --gpu-memory-utilization 0.94  \
      --no-enable-prefix-caching \
      --additional-config '{"recompute_scheduler_enable":true,"finegrained_tp_config": {"lmhead_tensor_parallel_size":2}}' \
      --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
      --kv-transfer-config \
      '{"kv_connector": "MooncakeConnectorV1",
      "kv_role": "kv_consumer",
      "kv_port": "28100",
      "engine_id": "'"${MOONCAKE_ENGINE_ID}"'",
      "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
      "kv_connector_extra_config": {
                "use_ascend_direct": true,
                "prefill": {
                        "dp_size": 4,
                        "tp_size": 2
                },
                "decode": {
                        "dp_size": 4,
                        "tp_size": 2
                }
          }
      }'

ConfigMap包含以下两个关键脚本：

prefill.sh（Prefill阶段启动脚本）

# 核心配置
网络接口（nic_name）: enp23s0f3 （可在节点上执行ip route | grep default命令获取）
服务端口（port）: 7100
KV 端口（kv_port）: 28000
数据并行度（data-parallel-size）：4
张量并行度（tensor-parallel-size）: 2

decode.sh（Decode阶段启动脚本）

# 核心配置
网络接口（nic_name）: enp23s0f3
服务端口（port）: 7101
KV 端口（kv_port）: 28100
最大并发序列数（max-num-seqs）: 40

执行以下命令，
```
kubectl apply -f config.yaml
```

部署ModelServing。

创建deepseek-serv.yaml文件，定义Prefill和Decode服务实例。

apiVersion: workload.serving.volcano.sh/v1alpha1
kind: ModelServing
metadata:
  name: deepseek-pd
  namespace: default
spec:
  schedulerName: volcano
  replicas: 1
  recoveryPolicy: ServingGroupRecreate
  template:
    restartGracePeriodSeconds: 60
    roles:
    - name: prefill
      replicas: 1
      workerReplicas: 0
      entryTemplate:
        spec:
          hostNetwork: true
          containers:
          - name: prefill
            image: quay.io/ascend/vllm-ascend:v0.13.0-a3
            command:
              - /bin/bash
            args:
              - '-c'
              - cd /workspace && ./prefill.sh
            env:
            - name: ROLE
              value: "prefill"
            - name: GROUP_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.labels['modelserving.volcano.sh/group-name']
            - name: ROLE_ID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.labels['modelserving.volcano.sh/role-id']
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            - name: NODE_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.hostIP
            - name: MODEL_LOCATION
              value: /models/DeepSeek-R1-Distill-Qwen-1.5B
            - name: TP_SIZE
              value: "2"
            - name: DP_SIZE
              value: "4"
            readinessProbe:
              httpGet:
                path: /health
                port: 7100
                scheme: HTTP
              initialDelaySeconds: 60
              periodSeconds: 10
              timeoutSeconds: 2
              failureThreshold: 3
            resources:
              limits:
                cpu: '94'
                huawei.com/ascend-1980: '8'
                memory: 900Gi
              requests:
                cpu: '32'
                huawei.com/ascend-1980: '8'
                memory: 350Gi
            ports:
            - containerPort: 7100
              name: server
            volumeMounts:
            - name: model
              mountPath: /models
            - name: dshm
              mountPath: /dev/shm
            - name: hccn-conf
              mountPath: /etc/hccn.conf
            - name: hccn-tool
              mountPath: /usr/local/Ascend/driver/tools/hccn_tool
            - name: ascend-install-info
              mountPath: /etc/ascend_install.info
            - name: config
              mountPath: /workspace/prefill.sh
              subPath: prefill.sh
          volumes:
          - name: model
            hostPath:
              path: /models
              type: Directory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: hccn-conf
            hostPath:
              path: /etc/hccn.conf
          - name: hccn-tool
            hostPath:
              path: /usr/local/Ascend/driver/tools/hccn_tool
          - name: ascend-install-info
            hostPath:
              path: /etc/ascend_install.info
          - name: config
            configMap:
              name: deepseek-pd-cm
              defaultMode: 0777
    - name: decode
      replicas: 1
      workerReplicas: 0
      entryTemplate:
        spec:
          hostNetwork: true
          containers:
          - name: decode
            image: quay.io/ascend/vllm-ascend:v0.13.0-a3
            command:
              - /bin/bash
            args:
              - '-c'
              - cd /workspace && ./decode.sh
            env:
            - name: ROLE
              value: "decode"
            - name: ENGINE_ID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            - name: NODE_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.hostIP
            - name: GROUP_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.labels['modelserving.volcano.sh/group-name']
            - name: ROLE_ID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.labels['modelserving.volcano.sh/role-id']
            - name: MODEL_LOCATION
              value: /models/DeepSeek-R1-Distill-Qwen-1.5B
            - name: TP_SIZE
              value: "2"
            - name: DP_SIZE
              value: "4"
            readinessProbe:
              httpGet:
                path: /health
                port: 7101
                scheme: HTTP
              initialDelaySeconds: 60
              periodSeconds: 10
              timeoutSeconds: 2
              failureThreshold: 3
            ports:
            - containerPort: 7101
              name: server
            resources:
              limits:
                cpu: '94'
                huawei.com/ascend-1980: '8'
                memory: 900Gi
              requests:
                cpu: '32'
                huawei.com/ascend-1980: '8'
                memory: 350Gi
            volumeMounts:
            - name: model
              mountPath: /models
            - name: dshm
              mountPath: /dev/shm
            - name: hccn-conf
              mountPath: /etc/hccn.conf
            - name: hccn-tool
              mountPath: /usr/local/Ascend/driver/tools/hccn_tool
            - name: ascend-install-info
              mountPath: /etc/ascend_install.info
            - name: config
              mountPath: /workspace/decode.sh
              subPath: decode.sh
          volumes:
          - name: model
            hostPath:
              path: /models
              type: Directory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: hccn-conf
            hostPath:
              path: /etc/hccn.conf
          - name: hccn-tool
            hostPath:
              path: /usr/local/Ascend/driver/tools/hccn_tool
          - name: ascend-install-info
            hostPath:
              path: /etc/ascend_install.info
          - name: config
            configMap:
              name: deepseek-pd-cm
              defaultMode: 0777

执行以下命令，部署ModelServing。

kubectl apply -f deepseek-serv.yaml

关键挂载点如下表。

挂载路径	说明
/models	模型文件目录。
/dev/shm	共享内存，用于进程间通信。
/etc/hccn.conf	NPU网络配置文件。
/workspace/prefill.sh 或 /workspace/decode.sh	启动脚本路径。

配置负载均衡代理。

执行以下命令，下载代理服务器脚本。

wget https://raw.githubusercontent.com/vllm-project/vllm-ascend/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py

执行以下命令，获取Pod IP地址。

kubectl get pods -owide

返回示例。

NAME                        READY   STATUS    RESTARTS   AGE   IP             NODE           NOMINATED NODE   READINESS GATES
deepseek-pd-0-decode-0-0    1/1     Running   0          20h   192.168.0.25   192.168.0.25   <none>           <none>
deepseek-pd-0-prefill-0-0   1/1     Running   0          20h   192.168.0.25   192.168.0.25   <none>           <none>

启动代理服务器。请根据部署环境修改端口和IP地址信息。

python3 load_balance_proxy_server_example.py \
  --port 8080 \
  --host 0.0.0.0 \
  --prefiller-hosts 192.168.0.25 \
  --prefiller-ports 7100 \
  --decoder-hosts 192.168.0.25 \
  --decoder-ports 7101

验证与测试。通过代理服务器端口和IP地址发送请求，具体端口和IP地址请根据实际情况修改。

发送测试请求。

curl -X POST http://192.168.0.25:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ds_r1",
    "messages": [
      {
        "role": "user",
        "content": "Hello, how are you?"
      }
    ],
    "max_tokens": 100
  }'

返回类似信息如下。

{
  "id": "chatcmpl-53cf0580-0e68-4623-80aa-1cf0fd923034",
  "object": "chat.completion",
  "created": 1776425897,
  "model": "ds_r1",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Okay, so I just received a message from someone asking, \"Hello, how are you?\" I need to respond appropriately. Let me think about the best way to handle this.\n\nFirst, I should consider the context. The user is greeting me, which is friendly. They're probably new or just reaching out for the first time. I should keep it warm and open-ended to encourage them to share more.\n\nI should acknowledge their greeting and express my greeting in a friendly manner. Maybe something like,",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": null,
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 11,
    "total_tokens": 111,
    "completion_tokens": 100,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}

查看日志输出。

Proxy日志

INFO:     Started server process [417174]
INFO:     Waiting for application startup.
Initialized 1 prefill clients and 1 decode clients.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
INFO:     192.168.0.25:53050 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     192.168.0.25:43946 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     192.168.0.25:56470 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     192.168.0.25:36910 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     192.168.0.25:51070 - "POST /v1/chat/completions HTTP/1.1" 200 OK

当返回信息中HTTP状态码为“200 OK”，表示请求处理成功。

Prefill Pod日志

执行以下命令，查看Prefill Pod日志。

kubectl logs deepseek-pd-0-prefill-0-0

返回关键信息如下所示。检查是否有Engine 000相关的日志，以及Delaying free。如果有，说明预填充计算正常，KV Cache已生成并准备传输。

(EngineCore_DP0 pid=142) INFO 04-17 11:38:17 [mooncake_connector.py:1062] Delaying free of 1 blocks for request chatcmpl-53cf0580-0e68-4623-80aa-1cf0fd923034
(APIServer pid=7) INFO:     192.168.0.25:52240 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=7) INFO:     192.168.1.191:34968 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=7) INFO 04-17 11:38:24 [loggers.py:248] Engine 000: Avg prompt throughput: 1.1 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 0.0%

Decode Pod日志

执行以下命令，查看Decode Pod日志。

kubectl logs deepseek-pd-0-decode-0-0

返回关键信息如下所示。可以查看下Avg generation throughput，如果大于0，说明模型正常运行。

(APIServer pid=8) INFO:     192.168.1.191:46710 - "GET /metrics HTTP/1.1" 200 OK
I0417 11:38:17.681255  1302 ascend_direct_transport.cpp:605] Transfer to:192.168.0.25:20294, cost: 5289 us
(Worker_DP0_TP0 pid=304) INFO 04-17 11:38:17 [mooncake_connector.py:561] KV cache transfer for request chatcmpl-53cf0580-0e68-4623-80aa-1cf0fd923034 took 5.85 ms (1 groups, 1 blocks). local_ip 192.168.0.25 local_device_id 0 remote_session_id 192.168.0.25:15910
I0417 11:38:17.700887  1272 ascend_direct_transport.cpp:605] Transfer to:192.168.0.25:21685, cost: 26734 us
(Worker_DP0_TP1 pid=307) INFO 04-17 11:38:17 [mooncake_connector.py:561] KV cache transfer for request chatcmpl-53cf0580-0e68-4623-80aa-1cf0fd923034 took 27.27 ms (1 groups, 1 blocks). local_ip 192.168.0.25 local_device_id 1 remote_session_id 192.168.0.25:16045
(APIServer pid=8) INFO:     192.168.0.25:47876 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=8) INFO:     192.168.0.25:47878 - "GET /health HTTP/1.1" 200 OK
(APIServer pid=8) INFO 04-17 11:38:25 [loggers.py:248] Engine 000: Avg prompt throughput: 1.1 tokens/s, Avg generation throughput: 10.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 100.0%

常见问题

报错AttributeError: 'Qwen2Config' object has no attribute 'head_dim'
当遇到类似 AttributeError: 'Qwen2Config' object has no attribute 'head_dim' 的错误时，说明当前模型配置中缺少head_dim字段。该参数用于定义注意力头的维度，是模型推理过程中的关键配置。

请手动编辑模型目录下的config.json文件，在其中添加 "head_dim": 128字段。若使用的是不同规模的模型，需根据实际参数动态计算该值，计算公式为：
```
head_dim = hidden_size / num_attention_heads 
```

父主题：云原生AI

上一篇：构建基于vLLM/SGlang主流推理引擎的云原生大模型监控大盘

下一篇：CCE分布式推理场景下的EMS高性能KV Cache实践