PD分离部署Qwen推理服务

使用Snt9b23资源基于PD分离场景下部署Qwen模型推理，您需要参考下述命令生成部署所需的infer_vllm_kubeinfer.yaml文件，其中“--xx-params“参数支持的参数详见表1，请按需配置。

# 场景参考：生成1P1D的Qwen3-32B部署yaml，实例副本数为2，权重为量化的w8a8c8权重，启动参数为版本推荐值
python3 gen_pd_deploy_kubeinfer_yaml_with_omni.py \
    --prefill-pod-num=1 \
    --decode-pod-num=1 \
    --replicas=2 \
    --pd-resource="{\"resource-cpu\": 88, \"resource-npu\": 8, \"resource-mem\": \"500Gi\"}" \
    --image-name="ascend_vllm:latest" \
    --mount-path=/mnt/deepseek \
    --script-path=/mnt/deepseek/deploy \
    --common-params="--extra-env-vars='ENABLE_PHASE_AWARE_QKVO_QUANT=1,ENABLE_QWEN_MICROBATCH=1,USE_ZMQ_BROADCAST=1,VALIDATORS_CONFIG_PATH=/home/ma-user/AscendCloud/AscendCloud-LLM/llm_inference/ascend_vllm/ascend_vllm/middlewares/validator_config.json,KV_CACHE_RETRY_TIMES=1,KV_CACHE_RETRY_WAIT_SECOND=0,SYNC_KV_TIMEOUT=6000' \
                     --prefill-extra-env-vars='PREFILL_STOP_SCHEDULE_TOKENS=8000,PROMPT_CROP_LAST_LAYER=1,CACHE_APC_NUM=4' \
                     --pd-port=9100 \
                     --vllm-log-path=/mnt/deepseek/vllm_log \
                     --time-window-ms=90000 \
                     --hang-threshold-sec=13 \
                     --model=/mnt/deepseek/model/Qwen3-32B-w8a8c8 \
                     --served-model-name=qwen \
                     --max-model-len=65536 \
                     --tensor-parallel-size=8 \
                     --gpu-memory-utilization=0.9 \
                     --no-enable-chunked-prefill \
                     --num-gpu-blocks-override=15000 \
                     --middleware=omni.adaptors.vllm.entrypoints.middleware.param_check.ValidateSamplingParams \
                     --quantization=compressed-tensors \
                     --kv-cache-dtype=int8 \
                     --enable-reasoning \
                     --reasoning-parser=qwen3" \
    --proxy-params="--port=9000" \
    --prefill-params="--max-num-seqs=64 --additional-config='{\"ascend_turbo_graph_config\": {\"enabled\": true, \"compile_models\": [\"prefill\", \"prefill-opt\"]}, \"async_pull_kv\": true, \"combine_block\": 8}'" \
    --decode-params="--max-num-seqs=48 --no-enable-prefix-caching --additional-config='{\"ascend_turbo_graph_config\": {\"enabled\": true, \"compile_models\": [\"decode\", \"decode-ge\"], \"compiled_ge_gear\": [32, 40, 48]}, \"ascend_scheduler_config\": {\"enabled\": true}, \"async_pull_kv\": true, \"multi_step\": true, \"combine_block\": 8}'"

根据部署架构，在工作节点或控制节点上执行下面的k8s命令，完成第三方开源大模型推理实例的部署。
```
kubectl apply -f infer_vllm_kubeinfer.yaml
```
执行下述命令查看部署状态，当全部Pod的“READ“字段结果都为”1/1”时表示部署成功。
```
kubectl get po | grep infer
```
执行下述命令获取 Service 的 ”CLUSTER-IP”。
```
kubectl get svc
```

测试推理API

curl -ik -H 'Content-Type: application/json' -d '{"messages":[{"role":"user","content":"hello"}],"model":"qwen3-32b","temperature":0.6,"top_p": 0.95,"top_k": 20,"max_tokens":1024}' -X POST http://${CLUSTER-IP}:9000/v1/chat/completions

父主题： 基于KubeInfer的推理部署

上一篇：PD分离部署DeepSeek推理服务

下一篇：LoadBalancer类型K8s service创建（可选）