Help Center/ Elastic Cloud Server/ Best Practices/ GPU ECS Best Practices/ Using Ray, Docker, and vLLM to Manually Deploy a DeepSeek-R1 or DeepSeek-V3 Model on Multi-GPU Linux ECSs
Updated on 2025-08-06 GMT+08:00

Using Ray, Docker, and vLLM to Manually Deploy a DeepSeek-R1 or DeepSeek-V3 Model on Multi-GPU Linux ECSs

Scenarios

DeepSeek-V3 and DeepSeek-R1 are two high-performance large language models launched by DeepSeek. DeepSeek-R1 is an inference model that is designed for math, code generation, and complex logical inference. Reinforcement learning (RL) is applied to this model to improve the inference performance. DeepSeek-V3 is a model that focuses on general tasks such as natural language processing, knowledge Q&A, and content creation. It aims to balance high performance and low cost and is suitable for applications such as intelligent customer service and personalized recommendation. In this section, we will learn how to use vLLM to quickly deploy a DeepSeek-R1 or DeepSeek-V3 model.

Solution Architecture

Figure 1 Manual deployment of a DeepSeek-R1 or DeepSeek-V3 model using Ray, Docker, and vLLM (Linux)

Advantages

You will be able to manually deploy DeepSeek-R1 and DeepSeek-V3 models using Ray, Docker, and vLLM and better understand model dependencies. This will give you a chance to experience the superb inference performance of DeepSeek-R1 and DeepSeek-V3 models.

Resource Planning

Table 1 Resources and costs

Resource

Description

Cost

VPC

VPC CIDR block: 192.168.0.0/16

Free

VPC subnet

  • AZ: AZ1
  • CIDR block: 192.168.0.0/24

Free

Security group

Inbound rule:

  • Priority: 1
  • Action: Allow
  • Type: IPv4
  • Protocol & Port: TCP:80
  • Source: 0.0.0.0/0

Free

ECS

  • Billing mode: Yearly/Monthly
  • AZ: AZ1
  • Flavor: See Table 2.
  • System disk: 200 GiB
  • Data disk: 1000 GiB
  • EIP: Auto assign
  • EIP type: Dynamic BGP
  • Billed by: Traffic
  • Bandwidth: 100 Mbit/s

The following resources generate costs:

  • Cloud servers
  • EVS disks
  • EIPs

For billing details, see Billing Mode Overview.

Table 2 GPU ECS flavors available for running a DeepSeek-R1 or DeepSeek-V3 model

No.

Model Name

Minimum Flavor

GPU

Nodes

0

DeepSeek-R1

DeepSeek-V3

p2s.16xlarge.8

V100 (32 GiB) × 8

8

p2v.16xlarge.8

V100 (16 GiB) × 8

16

pi2.4xlarge.4

T4 (16 GiB) × 8

16

Contact Huawei Cloud technical support to select GPU ECSs suitable for your deployment.

Implementation Procedure

  1. Create two GPU ECSs.

    1. Select the public image Huawei Cloud EulerOS 2.0 or Ubuntu 22.04 without a driver installed. Ubuntu 22.04 is used as an example.
      Figure 2 Selecting an image
    2. Select Auto assign for EIP. EIPs will be assigned for downloading dependencies and calling model APIs.

  2. Check the GPU driver and CUDA versions.

    Install the driver of version 535 and CUDA of 12.2. For details, see Manually Installing a Tesla Driver on a GPU-accelerated ECS.

  3. Install Ray.

    1. Update the pip.
      apt-get install -y python3 python3-pip
    2. Install Ray.
      pip install -U ray
    3. Verify the installation.
      python3 -c "import ray; ray.init()"

  4. Install Docker.

    1. Update the package index and install dependencies.
      apt-get update
      apt-get install -y ca-certificates curl gnupg lsb-release
    2. Add Docker's official GPG key.
      mkdir -p /etc/apt/keyrings
      curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
    3. Set the APT source for Docker.
      echo \
      "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
      $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
    4. Install Docker Engine.
      apt update
      apt  install docker-ce docker-ce-cli containerd.io docker-compose-plugin
    5. Configure Docker Hub.
      cat <<EOF > /etc/docker/daemon.json
      {
        "registry-mirrors": [
          "https://docker.m.daocloud.io",      
          "https://registry.cn-hangzhou.aliyuncs.com" 
        ]
      }
      EOF
      systemctl restart docker
    6. Check whether Docker is installed successfully.
      docker --version

  5. Install the NVIDIA Container Toolkit.

    Ubuntu 22.04 is used as an example. For details about how to install the NVIDIA Container Toolkit on other OSs, see Installing the NVIDIA Container Toolkit.

    1. Add NVIDIA's official GPG key.
      curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
      sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
      sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
    2. Set the APT source for the NVIDIA Container Toolkit.
      sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list
    3. Update the index package and install the NVIDIA Container Toolkit.
      apt update
      apt install -y nvidia-container-toolkit
    4. Configure Docker to use the NVIDIA Container Toolkit.
      nvidia-ctk runtime configure --runtime=docker
      systemctl restart docker

  6. Install dependencies, such as modelscope.

    1. Update the pip.
      python -m pip install --upgrade pip -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
    2. Install modelscope.
      pip install modelscope -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

      ModelScope is an open-source model platform in China. It is fast to download dependencies from ModelScope if you are in China. However, if you are outside China, download dependencies from Hugging Face.

  7. Download a Docker image.

    You can download the latest container image provided by vLLM or the image provided by the Huawei Cloud heterogeneous computing team.

    • Download the latest image from the vLLM official website.
      docker pull vllm.vllm-openai:latest
    • Download an image provided by the Huawei Cloud heterogeneous computing team.
      docker pull swr.cn-north-4.myhuaweicloud.com/hgcs/vllm0.7.3-pt312-ray2.43-cuda12.2:latest

      The vLLM version is v0.7.3. It will continue to be updated.

  8. Download the run_cluster.sh script.

    Download the run_cluster.sh file from https://github.com/vllm-project/vllm/blob/main/examples/online_serving/run_cluster.sh to start the Ray cluster.

  9. Download the DeepSeek-R1 or DeepSeek-V3 model file.

    1. Create a script for downloading a model.
      vim download_models.py

      Write the following content into the script:

      from modelscope import snapshot_download
      model_dir = snapshot_download('deepseek-ai/DeepSeek-R1', cache_dir='/root', revision='master')

      The model name DeepSeek-R1 is used as an example. You can change it to DeepSeek-V3. The local directory for storing the model is /root. You can change it as needed.

    2. Download a model.
      python3 download_models.py

      The total size of the model is 642 GB. The download may take 24 hours or longer, depending on the EIP bandwidth.

  10. Start the head node and all worker nodes of the Ray cluster.

    1. Start the head node of the Ray cluster.
      bash run_cluster.sh ${image-name} ${IP-address-of-the-head-node} --head ${model-directory} -e VLLM_HOST_IP=${IP-address-of-the-head-node} -e ${communications-library-environment-variable}

      Example:

      bash run_cluster.sh swr.cn-north-4.myhuaweicloud.com/hgcs/vllm0.7.3-pt312-ray2.43-cuda12.2 192.168.200.249 --head /root/deepseek-ai/DeepSeek-R1 -e VLLM_HOST_IP=192.168.200.249 -e GLOO_SOCKET_IFNAME=eth0 &
    2. Wait until a container is started and access it.
      docker exec -it ${container-name} /bin/bash
    3. Check the Ray cluster status.
      ray status
    4. Start all worker nodes in the Ray cluster.
      bash run_cluster.sh ${image-name} ${IP-address-of-the-head-node} --worker ${model-directory} -e VLLM_HOST_IP=${IP-address-of-the-current-node} -e ${communications-library-environment-variable}

      Example:

      bash run_cluster.sh swr.cn-north-4.myhuaweicloud.com/hgcs/vllm0.7.3-pt312-ray2.43-cuda12.2 192.168.200.249 --worker /root/deepseek-ai/DeepSeek-R1 -e VLLM_HOST_IP=192.168.200.211 -e GLOO_SOCKET_IFNAME=eth0 &
    5. Wait until a container is started and access it.
      docker exec -it ${container-name} /bin/bash
    6. Check the Ray cluster status.
      ray status

  11. In the container on the head node, start vllm.server to run the large model.

    1. Access the container on the head node.
    2. Run the large model.
      vllm serve ${model-mapping-address} --served_model_name ${model-name} --tensor-parallel-size ${GPU_NUM} --gpu_memory_utilization 0.9 --max_model_len 20480 --dtype float16 --enforce-eager

      ${model-mapping-address} indicates the container address in the run_cluster.sh script. By default, the value is /root/.cache/huggingface/. If the container address is changed in the script, change the value of ${model-mapping-address} accordingly.

  12. Call a model API to test the model performance.

    1. Call an API to chat.
      curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "DeepSeek-R1",
        "messages": [{"role": "user", "content": "hello\n"}]
      }'

    2. If a streaming conversation is required, add the stream parameter.
      curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "DeepSeek-R1",
        "messages": [{"role": "user", "content": "hello\n"}],
        "stream": true
      }'

      You can use an EIP to call a model API for chats from your local Postman or your own service.