文档首页/ AI开发平台ModelArts/ 最佳实践/ MLLM多模态模型训练推理/ moondream2基于Lite Server适配PyTorch NPU推理指导

更新时间：2025-07-29 GMT+08:00

查看PDF

moondream2基于Lite Server适配PyTorch NPU推理指导

方案概览

本文档从模型部署的环境配置、模型转换、模型推理等方面进行介绍moondream2模型在ModelArts Lite Server上部署，支持NPU推理场景。

本方案目前仅适用于部分企业客户，完成本方案的部署，需要先联系您所在企业的华为方技术支持。

资源规格要求

推理部署推荐使用Lite Server资源和Ascend Snt9B单机单卡。

表1 环境要求
名称	版本
CANN	cann_8.0.rc1
PyTorch	pytorch_2.1.0

获取镜像

表2 获取镜像
分类	名称	获取路径
基础镜像	西南-贵阳一：swr.cn-southwest-2.myhuaweicloud.com/atelier/pytorch_2_1_ascend:pytorch_2.1.0-cann_8.0.rc1-py_3.9-hce_2.0.2312-aarch64-snt9b-20240516142953-ca51f42	从SWR拉取。

Step1 准备环境

请参考Lite Server资源开通，购买Lite Server资源，并确保机器已开通，密码已获取，能通过SSH登录，不同机器之间网络互通。

当容器需要提供服务给多个用户，或者多个用户共享使用该容器时，应限制容器访问Openstack的管理地址（169.254.169.254），以防止容器获取宿主机的元数据。具体操作请参见禁止容器获取宿主机元数据。
检查环境。
1. SSH登录机器后，检查NPU设备状态。运行如下命令，返回NPU设备信息。
```
npu-smi info                    # 在每个实例节点上运行此命令可以看到NPU卡状态
npu-smi info -l | grep Total    # 在每个实例节点上运行此命令可以看到总卡数
```
  如出现错误，可能是机器上的NPU设备没有正常安装，或者NPU镜像被其他容器挂载。请先正常安装固件和驱动，或释放被挂载的NPU。
2. 检查docker是否安装。
```
docker -v   #检查docker是否安装
```
  如尚未安装，运行以下命令安装docker。
```
yum install -y docker-engine.aarch64 docker-engine-selinux.noarch docker-runc.aarch64
```
3. 配置IP转发，用于容器内的网络访问。执行以下命令查看net.ipv4.ip_forward配置项的值，如果为1，可跳过此步骤。
```
sysctl -p | grep net.ipv4.ip_forward
```
  如果net.ipv4.ip_forward配置项的值不为1，执行以下命令配置IP转发。
```
sed -i 's/net\.ipv4\.ip_forward=0/net\.ipv4\.ip_forward=1/g' /etc/sysctl.conf 
sysctl -p | grep net.ipv4.ip_forward
```

Step2 获取基础镜像

建议使用官方提供的镜像部署服务。镜像地址{image_url}参见表2。

docker pull {image_url}

Step3 启动容器镜像

启动容器镜像。启动前请先按照参数说明修改${}中的参数。
```
docker run -itd \
        --device=/dev/davinci1 \
        --device=/dev/davinci_manager \
        --device=/dev/devmm_svm \
        --device=/dev/hisi_hdc \
        -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
        -v /usr/local/dcmi:/usr/local/dcmi \
        -v /etc/ascend_install.info:/etc/ascend_install.info \
        -v /sys/fs/cgroup:/sys/fs/cgroup:ro \
        -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
        --shm-size 32g \
        --net=bridge \
        -v ${work_dir}:${container_work_dir} \
        --name ${container_name} \
        ${image_name} bash
```
参数说明：
- -v ${work_dir}:${container_work_dir}：代表需要在容器中挂载宿主机的目录。宿主机和容器使用不同的文件系统。work_dir为宿主机中工作目录，目录下存放着训练所需代码、数据等文件。container_work_dir为要挂载到的容器中的目录。为方便两个地址可以相同。
  - 容器不能挂载到/home/ma-user目录，此目录为ma-user用户家目录。如果容器挂载到/home/ma-user下，拉起容器时会与基础镜像冲突，导致基础镜像不可用。
  - driver及npu-smi需同时挂载至容器。
- --name ${container_name}：容器名称，进入容器时会用到，此处可以自己定义一个容器名称。
- ${image_name}：容器镜像的名称。
通过容器名称进入容器中。
```
docker exec -it ${container_name} bash
```

Step4 下载原始模型包

从HuggingFace官网下载moondream2模型包到本地，下载地址：https://huggingface.co/vikhyatk/moondream2/tree/2024-03-06。

在宿主机上创建一个空目录/home/temp，将下载的模型包存放在宿主机/home/temp/moondream2目录下，修改目录权限后，复制到容器中。

mkdir /home/temp        #创建一个空目录，将下载的模型包存放在宿主机/home/temp/moondream2目录下
chmod -R 777 moondream2     #修改moondream2目录权限
docker cp moondream2 moondream2:/home/ma-user/     #复制moondream2目录到容器中

Step5 准备测试数据

需要用户自己准备测试图片。

将测试图片存放在宿主机/home/temp/data目录下，修改目录权限后，复制到容器中。

chmod -R 777 data    #修改data目录权限
docker cp data moondream2:/home/ma-user/   #复制data目录到容器中

Step6 安装依赖

执行如下命令安装推理依赖。

pip install transformers timm einops torch==2.1.0 &&
pip install --upgrade sympy

Step7 启动推理

在容器/home/ma-user下运行启动推理脚本infer.py，NPU推理脚本内容参见附录1：在NPU上运行infer.py脚本内容。

python infer.py

运行结束后，会打印所有图片预测的平均时延。

NPU上运行后，结果会保存在/home/ma-user/result.txt下。

如果在GPU上运行，推荐直接在GPU宿主机上执行，因此不需要启动容器，直接将模型和数据复制到相应目录，然后安装PIP依赖后就可以运行。GPU推理脚本内容参见附录2：在GPU上运行infer.py脚本内容。

附录1：在NPU上运行infer.py脚本内容

NPU上运行推理的infer.py脚本内容如下：

from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import torch
import os
import time

import torch_npu
#from torch_npu.contrib import transfer_to_npu

import torchair as tng
from torchair.configs.compiler_config import CompilerConfig
#import logging
#from torchair.core.utils import logger
# 是否开启DEBUG日志
# logger.setLevel(logging.DEBUG)

model_id = "./moondream2"
revision = "2024-03-13"
model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision=revision
)

device = 'npu:0'
model = model.to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)


config = CompilerConfig()
npu_backend = tng.get_npu_backend(compiler_config=config)
model.text_model.transformer = torch.compile(model.text_model.transformer, backend=npu_backend, dynamic=True, fullgraph=True)

filenames = os.listdir(r'./data')
filenames = sorted(filenames)
count = 0
total_time = 0.0
not_num = 1
with open("./result.txt", 'w+') as f:
   for file in filenames:
      t1 = time.time()
      image = Image.open('./data/'+file)
      enc_image = model.encode_image(image)
      enc_image = enc_image.to(device)
      result = model.answer_question(enc_image, "Describe in detail what is in the video frame. The rule is: first describe the main body of the character in the video frame, including action, state, characteristics, etc., do not make associations or summarize. Then describe the environment, such as the background; then describe how the video was shot, such as close-ups. Do not appear 'seems', 'may' and other words, need to be sure of the description, do not need to be ambiguous description.", tokenizer)
      cost = time.time()-t1
      if not_num <=0:
         count = count+1
         total_time += cost
         print("infer time:"+str(cost))
         print("average infer time:"+str(total_time/count), " total count:"+str(count))
      else:
         not_num = not_num -1
      f.write(file + ":" + "\n")
      f.write(result + "\n\n")

附录2：在GPU上运行infer.py脚本内容

GPU上运行推理的infer.py脚本内容如下：

from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import torch
import os
import time

model_id = "./moondream2"
revision = "2024-03-13"
model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision=revision
)

device = 'cuda:0'
model = model.to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)

filenames = os.listdir(r'./data')
filenames = sorted(filenames)
count = 0
total_time = 0.0
not_num = 1
with open("./result.txt", 'w+') as f:
   for file in filenames:
      t1 = time.time()
      image = Image.open('./data/'+file)
      enc_image = model.encode_image(image)
      enc_image = enc_image.to(device)
      result = model.answer_question(enc_image, "Describe in detail what is in the video frame. The rule is: first describe the main body of the character in the video frame, including action, state, characteristics, etc., do not make associations or summarize. Then describe the environment, such as the background; then describe how the video was shot, such as close-ups. Do not appear 'seems', 'may' and other words, need to be sure of the description, do not need to be ambiguous description.", tokenizer)
      cost = time.time()-t1
      if not_num <=0:
         count = count+1
         total_time += cost
         print("infer time:"+str(cost))
         print("average infer time:"+str(total_time/count), " total count:"+str(count))
      else:
         not_num = not_num -1
      f.write(file + ":" + "\n")
      f.write(result + "\n\n")

父主题： MLLM多模态模型训练推理

上一篇：LLaMA-VID基于Lite Server适配PyTorch NPU推理指导（6.3.910）

下一篇：图像生成模型训练推理

意见反馈

文档内容是否对您有帮助？

有帮助没帮助

提供反馈

提交成功！非常感谢您的反馈，我们会继续努力做到更好！您可在我的云声建议查看反馈及问题处理状态。

系统繁忙，请稍后重试

在使用文档中是否遇到以下问题

内容与产品页面不一致

内容不易理解

缺失示例代码

步骤不可操作

搜不到想要的内容

缺少最佳实践

意见反馈（选填）

0/500

请至少选择一项反馈信息并填写问题反馈

字符长度不能超过500

直接提交取消

如您有其它疑问，您也可以通过华为云社区问答频道来与我们联系探讨

盘古Doer提问云社区提问