文档首页/ 云容器引擎 CCE/ 最佳实践/ 云原生AI/ ModelServing结合Mooncake部署指南
更新时间:2026-05-21 GMT+08:00
分享

ModelServing结合Mooncake部署指南

本文档在A3昇腾集群(单超节点)上,使用vLLM-ascend和Kthena推理平台部署DeepSeek-R1-Distill-Qwen-1.5B模型的1P1D(1 Prefill + 1 Decode) 分离架构的最佳实践。该架构通过物理隔离Prefill和Decode阶段,利用Mooncake Connector实现KV Cache传输,显著优化资源利用率和推理性能。

背景信息

PD分离式部署(Mooncake + KV-Cache)相比于传统部署方式的优势如下:

  • 物理隔离:Prefill与Decode运行在独立节点,互不干扰。

  • 降低延迟抖动:避免长Prompt处理阻塞其他请求的Token生成。

  • 支持高并发:通过KV Cache分离传输,实现高效资源调度。

核心组件说明如下。

组件

作用

vLLM-ascend

vLLM的昇腾NPU优化版本。

Kthena

华为云模型服务编排平台,用于统一管理ModelServing实例。

Mooncake Connector

KV Cache分离传输协议,实现Prefill与Decode节点间高效共享。

流程图

前提条件

  • 已安装v1.20.15及以上版本的Volcano调度器插件,且设置默认调度器为Volcano。
  • 部署前,请确认节点间的网络状态正常。您可参考Verification Process进行验证。

约束与限制

本验证流程依赖物理机(BMS)。在虚拟机(VM)上运行无法确保NPU网络通信正常,相关问题需自行解决。

操作流程

  1. 准备模型。

    1. 请自行在本地下载大模型或从华为开源镜像仓获取,放置到节点的/models/DeepSeek-R1-Distill-Qwen-1.5B目录。
    2. 解压模型至指定路径。
      unzip <下载的模型文件> -d /models

  2. 创建ConfigMap。

    1. 创建config.yaml文件,定义Prefill和Decode启动脚本。根据所选模型调整启动脚本参数,具体请参考vLLM Ascend官方文档。
      kind: ConfigMap
      apiVersion: v1
      metadata:
        name: deepseek-pd-cm
      data:
        prefill.sh: |
          nic_name="enp23s0f3"  # network card name
          local_ip=$POD_IP
          export HCCL_IF_IP=$local_ip
          export GLOO_SOCKET_IFNAME=$nic_name
          export TP_SOCKET_IFNAME=$nic_name
          export HCCL_SOCKET_IFNAME=$nic_name
          export OMP_PROC_BIND=false
          export OMP_NUM_THREADS=10
          export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
          export HCCL_BUFFSIZE=256
          export TASK_QUEUE_ENABLE=1
          export HCCL_OP_EXPANSION_MODE="AIV"
          export VLLM_USE_V1=1
          export MOONCAKE_ENGINE_ID="${GROUP_NAME}_${ROLE_ID}"
          vllm serve $MODEL_LOCATION \
            --host $POD_IP \
            --port "7100" \
            --data-parallel-size 4 \
            --data-parallel-size-local 4 \
            --data-parallel-address $POD_IP \
            --data-parallel-rpc-port 12321 \
            --tensor-parallel-size 2 \
            --seed 1024 \
            --served-model-name ds_r1 \
            --max-model-len 40000 \
            --max-num-batched-tokens 16384 \
            --max-num-seqs 8 \
            --enforce-eager \
            --trust-remote-code \
            --gpu-memory-utilization 0.9  \
            --no-enable-prefix-caching \
            --additional-config '{"recompute_scheduler_enable":true}' \
            --kv-transfer-config \
            '{"kv_connector": "MooncakeConnectorV1",
            "kv_role": "kv_producer",
            "kv_port": "28000",
            "engine_id": "'"${MOONCAKE_ENGINE_ID}"'",
            "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
            "kv_connector_extra_config": {
                      "use_ascend_direct": true,
                      "prefill": {
                              "dp_size": 4,
                              "tp_size": 2
                      },
                      "decode": {
                              "dp_size": 4,
                              "tp_size": 2
                      }
                }
            }'
        decode.sh: |
          nic_name="enp23s0f3"  # network card name
          local_ip=$POD_IP
          export HCCL_IF_IP=$local_ip
          export GLOO_SOCKET_IFNAME=$nic_name
          export TP_SOCKET_IFNAME=$nic_name
          export HCCL_SOCKET_IFNAME=$nic_name
          export OMP_PROC_BIND=false
          export OMP_NUM_THREADS=10
          export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
          export HCCL_BUFFSIZE=600
          export TASK_QUEUE_ENABLE=1
          export HCCL_OP_EXPANSION_MODE="AIV"
          export VLLM_USE_V1=1
          export MOONCAKE_ENGINE_ID="${GROUP_NAME}_${ROLE_ID}"
          vllm serve $MODEL_LOCATION \
            --host $POD_IP \
            --port "7101" \
            --data-parallel-size 4 \
            --data-parallel-address $POD_IP \
            --data-parallel-rpc-port 12322 \
            --tensor-parallel-size 2 \
            --seed 1024 \
            --served-model-name ds_r1 \
            --max-model-len 40000 \
            --max-num-batched-tokens 256 \
            --max-num-seqs 40 \
            --trust-remote-code \
            --gpu-memory-utilization 0.94  \
            --no-enable-prefix-caching \
            --additional-config '{"recompute_scheduler_enable":true,"finegrained_tp_config": {"lmhead_tensor_parallel_size":2}}' \
            --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
            --kv-transfer-config \
            '{"kv_connector": "MooncakeConnectorV1",
            "kv_role": "kv_consumer",
            "kv_port": "28100",
            "engine_id": "'"${MOONCAKE_ENGINE_ID}"'",
            "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
            "kv_connector_extra_config": {
                      "use_ascend_direct": true,
                      "prefill": {
                              "dp_size": 4,
                              "tp_size": 2
                      },
                      "decode": {
                              "dp_size": 4,
                              "tp_size": 2
                      }
                }
            }'

      ConfigMap包含以下两个关键脚本:

      • prefill.sh(Prefill阶段启动脚本)
        # 核心配置
        网络接口(nic_name): enp23s0f3 (可在节点上执行ip route | grep default命令获取)
        服务端口(port): 7100
        KV 端口(kv_port): 28000
        数据并行度(data-parallel-size):4
        张量并行度(tensor-parallel-size): 2
      • decode.sh(Decode阶段启动脚本)
        # 核心配置
        网络接口(nic_name): enp23s0f3
        服务端口(port): 7101
        KV 端口(kv_port): 28100
        最大并发序列数(max-num-seqs): 40
    2. 执行以下命令,
      kubectl apply -f config.yaml

  3. 部署ModelServing。

    1. 创建deepseek-serv.yaml文件,定义Prefill和Decode服务实例。
      apiVersion: workload.serving.volcano.sh/v1alpha1
      kind: ModelServing
      metadata:
        name: deepseek-pd
        namespace: default
      spec:
        schedulerName: volcano
        replicas: 1
        recoveryPolicy: ServingGroupRecreate
        template:
          restartGracePeriodSeconds: 60
          roles:
          - name: prefill
            replicas: 1
            workerReplicas: 0
            entryTemplate:
              spec:
                hostNetwork: true
                containers:
                - name: prefill
                  image: quay.io/ascend/vllm-ascend:v0.13.0-a3
                  command:
                    - /bin/bash
                  args:
                    - '-c'
                    - cd /workspace && ./prefill.sh
                  env:
                  - name: ROLE
                    value: "prefill"
                  - name: GROUP_NAME
                    valueFrom:
                      fieldRef:
                        fieldPath: metadata.labels['modelserving.volcano.sh/group-name']
                  - name: ROLE_ID
                    valueFrom:
                      fieldRef:
                        fieldPath: metadata.labels['modelserving.volcano.sh/role-id']
                  - name: POD_IP
                    valueFrom:
                      fieldRef:
                        fieldPath: status.podIP
                  - name: NODE_IP
                    valueFrom:
                      fieldRef:
                        fieldPath: status.hostIP
                  - name: MODEL_LOCATION
                    value: /models/DeepSeek-R1-Distill-Qwen-1.5B
                  - name: TP_SIZE
                    value: "2"
                  - name: DP_SIZE
                    value: "4"
                  readinessProbe:
                    httpGet:
                      path: /health
                      port: 7100
                      scheme: HTTP
                    initialDelaySeconds: 60
                    periodSeconds: 10
                    timeoutSeconds: 2
                    failureThreshold: 3
                  resources:
                    limits:
                      cpu: '94'
                      huawei.com/ascend-1980: '8'
                      memory: 900Gi
                    requests:
                      cpu: '32'
                      huawei.com/ascend-1980: '8'
                      memory: 350Gi
                  ports:
                  - containerPort: 7100
                    name: server
                  volumeMounts:
                  - name: model
                    mountPath: /models
                  - name: dshm
                    mountPath: /dev/shm
                  - name: hccn-conf
                    mountPath: /etc/hccn.conf
                  - name: hccn-tool
                    mountPath: /usr/local/Ascend/driver/tools/hccn_tool
                  - name: ascend-install-info
                    mountPath: /etc/ascend_install.info
                  - name: config
                    mountPath: /workspace/prefill.sh
                    subPath: prefill.sh
                volumes:
                - name: model
                  hostPath:
                    path: /models
                    type: Directory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: hccn-conf
                  hostPath:
                    path: /etc/hccn.conf
                - name: hccn-tool
                  hostPath:
                    path: /usr/local/Ascend/driver/tools/hccn_tool
                - name: ascend-install-info
                  hostPath:
                    path: /etc/ascend_install.info
                - name: config
                  configMap:
                    name: deepseek-pd-cm
                    defaultMode: 0777
          - name: decode
            replicas: 1
            workerReplicas: 0
            entryTemplate:
              spec:
                hostNetwork: true
                containers:
                - name: decode
                  image: quay.io/ascend/vllm-ascend:v0.13.0-a3
                  command:
                    - /bin/bash
                  args:
                    - '-c'
                    - cd /workspace && ./decode.sh
                  env:
                  - name: ROLE
                    value: "decode"
                  - name: ENGINE_ID
                    valueFrom:
                      fieldRef:
                        fieldPath: metadata.name
                  - name: POD_IP
                    valueFrom:
                      fieldRef:
                        fieldPath: status.podIP
                  - name: NODE_IP
                    valueFrom:
                      fieldRef:
                        fieldPath: status.hostIP
                  - name: GROUP_NAME
                    valueFrom:
                      fieldRef:
                        fieldPath: metadata.labels['modelserving.volcano.sh/group-name']
                  - name: ROLE_ID
                    valueFrom:
                      fieldRef:
                        fieldPath: metadata.labels['modelserving.volcano.sh/role-id']
                  - name: MODEL_LOCATION
                    value: /models/DeepSeek-R1-Distill-Qwen-1.5B
                  - name: TP_SIZE
                    value: "2"
                  - name: DP_SIZE
                    value: "4"
                  readinessProbe:
                    httpGet:
                      path: /health
                      port: 7101
                      scheme: HTTP
                    initialDelaySeconds: 60
                    periodSeconds: 10
                    timeoutSeconds: 2
                    failureThreshold: 3
                  ports:
                  - containerPort: 7101
                    name: server
                  resources:
                    limits:
                      cpu: '94'
                      huawei.com/ascend-1980: '8'
                      memory: 900Gi
                    requests:
                      cpu: '32'
                      huawei.com/ascend-1980: '8'
                      memory: 350Gi
                  volumeMounts:
                  - name: model
                    mountPath: /models
                  - name: dshm
                    mountPath: /dev/shm
                  - name: hccn-conf
                    mountPath: /etc/hccn.conf
                  - name: hccn-tool
                    mountPath: /usr/local/Ascend/driver/tools/hccn_tool
                  - name: ascend-install-info
                    mountPath: /etc/ascend_install.info
                  - name: config
                    mountPath: /workspace/decode.sh
                    subPath: decode.sh
                volumes:
                - name: model
                  hostPath:
                    path: /models
                    type: Directory
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: hccn-conf
                  hostPath:
                    path: /etc/hccn.conf
                - name: hccn-tool
                  hostPath:
                    path: /usr/local/Ascend/driver/tools/hccn_tool
                - name: ascend-install-info
                  hostPath:
                    path: /etc/ascend_install.info
                - name: config
                  configMap:
                    name: deepseek-pd-cm
                    defaultMode: 0777
    2. 执行以下命令,部署ModelServing。
      kubectl apply -f deepseek-serv.yaml

      关键挂载点如下表。

      挂载路径

      说明

      /models

      模型文件目录。

      /dev/shm

      共享内存,用于进程间通信。

      /etc/hccn.conf

      NPU网络配置文件。

      /workspace/prefill.sh 或 /workspace/decode.sh

      启动脚本路径。

  4. 配置负载均衡代理。

    1. 执行以下命令,下载代理服务器脚本。
      wget https://raw.githubusercontent.com/vllm-project/vllm-ascend/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py
    2. 执行以下命令,获取Pod IP地址。
      kubectl get pods -owide

      返回示例。

      NAME                        READY   STATUS    RESTARTS   AGE   IP             NODE           NOMINATED NODE   READINESS GATES
      deepseek-pd-0-decode-0-0    1/1     Running   0          20h   192.168.0.25   192.168.0.25   <none>           <none>
      deepseek-pd-0-prefill-0-0   1/1     Running   0          20h   192.168.0.25   192.168.0.25   <none>           <none>
    3. 启动代理服务器。请根据部署环境修改端口和IP地址信息。
      python3 load_balance_proxy_server_example.py \
        --port 8080 \
        --host 0.0.0.0 \
        --prefiller-hosts 192.168.0.25 \
        --prefiller-ports 7100 \
        --decoder-hosts 192.168.0.25 \
        --decoder-ports 7101

  5. 验证与测试。通过代理服务器端口和IP地址发送请求,具体端口和IP地址请根据实际情况修改。

    1. 发送测试请求。
      curl -X POST http://192.168.0.25:8080/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
          "model": "ds_r1",
          "messages": [
            {
              "role": "user",
              "content": "Hello, how are you?"
            }
          ],
          "max_tokens": 100
        }'

      返回类似信息如下。

      {
        "id": "chatcmpl-53cf0580-0e68-4623-80aa-1cf0fd923034",
        "object": "chat.completion",
        "created": 1776425897,
        "model": "ds_r1",
        "choices": [
          {
            "index": 0,
            "message": {
              "role": "assistant",
              "content": "Okay, so I just received a message from someone asking, \"Hello, how are you?\" I need to respond appropriately. Let me think about the best way to handle this.\n\nFirst, I should consider the context. The user is greeting me, which is friendly. They're probably new or just reaching out for the first time. I should keep it warm and open-ended to encourage them to share more.\n\nI should acknowledge their greeting and express my greeting in a friendly manner. Maybe something like,",
              "refusal": null,
              "annotations": null,
              "audio": null,
              "function_call": null,
              "tool_calls": [],
              "reasoning": null,
              "reasoning_content": null
            },
            "logprobs": null,
            "finish_reason": "length",
            "stop_reason": null,
            "token_ids": null
          }
        ],
        "service_tier": null,
        "system_fingerprint": null,
        "usage": {
          "prompt_tokens": 11,
          "total_tokens": 111,
          "completion_tokens": 100,
          "prompt_tokens_details": null
        },
        "prompt_logprobs": null,
        "prompt_token_ids": null,
        "kv_transfer_params": null
      }
    2. 查看日志输出。
      • Proxy日志
        INFO:     Started server process [417174]
        INFO:     Waiting for application startup.
        Initialized 1 prefill clients and 1 decode clients.
        INFO:     Application startup complete.
        INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
        INFO:     192.168.0.25:53050 - "POST /v1/completions HTTP/1.1" 200 OK
        INFO:     192.168.0.25:43946 - "POST /v1/chat/completions HTTP/1.1" 200 OK
        INFO:     192.168.0.25:56470 - "POST /v1/completions HTTP/1.1" 200 OK
        INFO:     192.168.0.25:36910 - "POST /v1/chat/completions HTTP/1.1" 200 OK
        INFO:     192.168.0.25:51070 - "POST /v1/chat/completions HTTP/1.1" 200 OK

        当返回信息中HTTP状态码为“200 OK”,表示请求处理成功。

      • Prefill Pod日志

        执行以下命令,查看Prefill Pod日志。

        kubectl logs deepseek-pd-0-prefill-0-0

        返回关键信息如下所示。检查是否有Engine 000相关的日志,以及Delaying free。如果有,说明预填充计算正常,KV Cache已生成并准备传输。

        (EngineCore_DP0 pid=142) INFO 04-17 11:38:17 [mooncake_connector.py:1062] Delaying free of 1 blocks for request chatcmpl-53cf0580-0e68-4623-80aa-1cf0fd923034
        (APIServer pid=7) INFO:     192.168.0.25:52240 - "POST /v1/chat/completions HTTP/1.1" 200 OK
        (APIServer pid=7) INFO:     192.168.1.191:34968 - "GET /metrics HTTP/1.1" 200 OK
        (APIServer pid=7) INFO 04-17 11:38:24 [loggers.py:248] Engine 000: Avg prompt throughput: 1.1 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 0.0%
      • Decode Pod日志

        执行以下命令,查看Decode Pod日志。

        kubectl logs deepseek-pd-0-decode-0-0

        返回关键信息如下所示。可以查看下Avg generation throughput,如果大于0,说明模型正常运行。

        (APIServer pid=8) INFO:     192.168.1.191:46710 - "GET /metrics HTTP/1.1" 200 OK
        I0417 11:38:17.681255  1302 ascend_direct_transport.cpp:605] Transfer to:192.168.0.25:20294, cost: 5289 us
        (Worker_DP0_TP0 pid=304) INFO 04-17 11:38:17 [mooncake_connector.py:561] KV cache transfer for request chatcmpl-53cf0580-0e68-4623-80aa-1cf0fd923034 took 5.85 ms (1 groups, 1 blocks). local_ip 192.168.0.25 local_device_id 0 remote_session_id 192.168.0.25:15910
        I0417 11:38:17.700887  1272 ascend_direct_transport.cpp:605] Transfer to:192.168.0.25:21685, cost: 26734 us
        (Worker_DP0_TP1 pid=307) INFO 04-17 11:38:17 [mooncake_connector.py:561] KV cache transfer for request chatcmpl-53cf0580-0e68-4623-80aa-1cf0fd923034 took 27.27 ms (1 groups, 1 blocks). local_ip 192.168.0.25 local_device_id 1 remote_session_id 192.168.0.25:16045
        (APIServer pid=8) INFO:     192.168.0.25:47876 - "POST /v1/chat/completions HTTP/1.1" 200 OK
        (APIServer pid=8) INFO:     192.168.0.25:47878 - "GET /health HTTP/1.1" 200 OK
        (APIServer pid=8) INFO 04-17 11:38:25 [loggers.py:248] Engine 000: Avg prompt throughput: 1.1 tokens/s, Avg generation throughput: 10.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 100.0%

常见问题

  • 报错AttributeError: 'Qwen2Config' object has no attribute 'head_dim'

    当遇到类似 AttributeError: 'Qwen2Config' object has no attribute 'head_dim' 的错误时,说明当前模型配置中缺少head_dim字段。该参数用于定义注意力头的维度,是模型推理过程中的关键配置。

    请手动编辑模型目录下的config.json文件,在其中添加 "head_dim": 128字段。若使用的是不同规模的模型,需根据实际参数动态计算该值,计算公式为:

    head_dim = hidden_size / num_attention_heads 

相关文档