Help Center/ Elastic Cloud Server/ Best Practices/ GPU ECS Best Practices/ Using Ray, Docker, and vLLM to Manually Deploy a DeepSeek-R1 or DeepSeek-V3 Model on Multi-GPU Linux ECSs

Updated on 2025-09-23 GMT+08:00

View PDF

Using Ray, Docker, and vLLM to Manually Deploy a DeepSeek-R1 or DeepSeek-V3 Model on Multi-GPU Linux ECSs

Scenarios

DeepSeek-V3 and DeepSeek-R1 are two high-performance large language models launched by DeepSeek. DeepSeek-R1 is an inference model that is designed for math, code generation, and complex logical inference. Reinforcement learning (RL) is applied to this model to improve the inference performance. DeepSeek-V3 is a model that focuses on general tasks such as natural language processing, knowledge Q&A, and content creation. It aims to balance high performance and low cost and is suitable for applications such as intelligent customer service and personalized recommendation. In this section, we will learn how to use vLLM to quickly deploy a DeepSeek-R1 or DeepSeek-V3 model.

Solution Architecture

Figure 1 Manual deployment of a DeepSeek-R1 or DeepSeek-V3 model using Ray, Docker, and vLLM (Linux)
Click to enlarge

Advantages

You will be able to manually deploy DeepSeek-R1 and DeepSeek-V3 models using Ray, Docker, and vLLM and better understand model dependencies. This will give you a chance to experience the superb inference performance of DeepSeek-R1 and DeepSeek-V3 models.

Resource Planning

**Table 1** Resources and costs
Resource	Description	Cost
VPC	VPC CIDR block: 192.168.0.0/16	Free
VPC subnet	AZ: AZ1 CIDR block: 192.168.0.0/24	Free
Security group	Inbound rule: Priority: 1 Action: Allow Type: IPv4 Protocol & Port: TCP:80 Source: 0.0.0.0/0	Free
ECS	Billing mode: Yearly/Monthly AZ: AZ1 Flavor: See Table 2. System disk: 200 GiB Data disk: 1,000 GB EIP: Auto assign EIP type: Dynamic BGP Billed by: Traffic Bandwidth: 100 Mbit/s	The following resources generate costs: Cloud servers EVS disks EIPs For billing details, see Billing Mode Overview.

**Table 2** GPU ECS flavors available for running a DeepSeek-R1 or DeepSeek-V3 model
No.	Model Name	Minimum Flavor	GPU	Nodes
0	DeepSeek-R1 DeepSeek-V3	p2s.16xlarge.8	V100 (32 GiB) × 8	8
		p2v.16xlarge.8	V100 (16 GiB) × 8	16
		pi2.4xlarge.4	T4 (16 GiB) × 8	16

Contact Huawei Cloud technical support to select GPU ECSs suitable for your deployment.

Manually Deploying a DeepSeek-R1 or DeepSeek-V3 model Using Ray, Docker, and vLLM

To use Ray, Docker, and vLLM to manually deploy a DeepSeek-R1 or DeepSeek-V3 model on multi-GPU Linux ECSs, do as follows:

Create two GPU ECSs.
Check the GPU driver and CUDA versions.
Install Ray.
Install Docker.
Install the NVIDIA Container Toolkit.
Install dependencies, such as modelscope.
Download a Docker image.
Download the run_cluster.sh script.
Download the DeepSeek-R1 or DeepSeek-V3 model file.
Start the head node and all worker nodes of the Ray cluster.
In the container on the head node, start vllm.server to run the large model.
Call a model API to test the model performance.

Implementation Procedure

Create two GPU ECSs.
1. Select the public image Huawei Cloud EulerOS 2.0 or Ubuntu 22.04 without a driver installed. Ubuntu 22.04 is used as an example.
  Figure 2 Selecting an image
2. Select Auto assign for EIP. EIPs will be assigned for downloading dependencies and calling model APIs.
Check the GPU driver and CUDA versions.

Install the driver of version 535 and CUDA of 12.2. For details, see Manually Installing a Tesla Driver on a GPU-accelerated ECS.

Install Ray.

Update the pip.
```
apt-get install -y python3 python3-pip
```
Install Ray.
```
pip install -U ray
```
Verify the installation.
```
python3 -c "import ray; ray.init()"
```

Install Docker.

Update the package index and install dependencies.

apt-get update
apt-get install -y ca-certificates curl gnupg lsb-release

Add Docker's official GPG key.

mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

Set the APT source for Docker.

echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

Install Docker Engine.

apt update
apt  install docker-ce docker-ce-cli containerd.io docker-compose-plugin

Configure Docker Hub.

cat <<EOF > /etc/docker/daemon.json
{
  "registry-mirrors": [
    "https://docker.m.daocloud.io",      
    "https://registry.cn-hangzhou.aliyuncs.com" 
  ]
}
EOF
systemctl restart docker

Check whether Docker is installed successfully.
```
docker --version
```

Install the NVIDIA Container Toolkit.

Ubuntu 22.04 is used as an example. For details about how to install the NVIDIA Container Toolkit on other OSs, see Installing the NVIDIA Container Toolkit.

Add NVIDIA's official GPG key.

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Set the APT source for the NVIDIA Container Toolkit.

sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list

Update the index package and install the NVIDIA Container Toolkit.
```
apt update
apt install -y nvidia-container-toolkit
```

Configure Docker to use the NVIDIA Container Toolkit.

nvidia-ctk runtime configure --runtime=docker
systemctl restart docker

Install dependencies, such as modelscope.
1. Update the pip.
```
python -m pip install --upgrade pip -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
```
2. Install modelscope.
```
pip install modelscope -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
```
  ModelScope is an open-source model platform in China. It is fast to download dependencies from ModelScope if you are in China. However, if you are outside China, download dependencies from Hugging Face.
Download a Docker image.

You can download the latest container image provided by vLLM or the image provided by the Huawei Cloud heterogeneous computing team.
- Download the latest image from the vLLM official website.
```
docker pull vllm.vllm-openai:latest
```
- Download an image provided by the Huawei Cloud heterogeneous computing team.
```
docker pull swr.cn-north-4.myhuaweicloud.com/hgcs/vllm0.7.3-pt312-ray2.43-cuda12.2:latest
```
  The vLLM version is v0.7.3. It will continue to be updated.
Download the run_cluster.sh script.

Download the run_cluster.sh file from https://github.com/vllm-project/vllm/blob/main/examples/online_serving/run_cluster.sh to start the Ray cluster.
Download the DeepSeek-R1 or DeepSeek-V3 model file.
1. Create a script for downloading a model.
```
vim download_models.py
```
  Write the following content into the script:
```
from modelscope import snapshot_download
model_dir = snapshot_download('deepseek-ai/DeepSeek-R1', cache_dir='/root', revision='master')
```
  The model name DeepSeek-R1 is used as an example. You can change it to DeepSeek-V3. The local directory for storing the model is /root. You can change it as needed.
2. Download a model.
```
python3 download_models.py
```
  The total size of the model is 642 GB. The download may take 24 hours or longer, depending on the EIP bandwidth.

Start the head node and all worker nodes of the Ray cluster.

Start the head node of the Ray cluster.

bash run_cluster.sh ${image-name} ${IP-address-of-the-head-node} --head ${model-directory} -e VLLM_HOST_IP=${IP-address-of-the-head-node} -e ${communications-library-environment-variable}

Example:

bash run_cluster.sh swr.cn-north-4.myhuaweicloud.com/hgcs/vllm0.7.3-pt312-ray2.43-cuda12.2 192.168.200.249 --head /root/deepseek-ai/DeepSeek-R1 -e VLLM_HOST_IP=192.168.200.249 -e GLOO_SOCKET_IFNAME=eth0 &

Wait until a container is started and access it.
```
docker exec -it ${container-name} /bin/bash
```
Check the Ray cluster status.
```
ray status
```

Start all worker nodes in the Ray cluster.

bash run_cluster.sh ${image-name} ${IP-address-of-the-head-node} --worker ${model-directory} -e VLLM_HOST_IP=${IP-address-of-the-current-node} -e ${communications-library-environment-variable}

Example:

bash run_cluster.sh swr.cn-north-4.myhuaweicloud.com/hgcs/vllm0.7.3-pt312-ray2.43-cuda12.2 192.168.200.249 --worker /root/deepseek-ai/DeepSeek-R1 -e VLLM_HOST_IP=192.168.200.211 -e GLOO_SOCKET_IFNAME=eth0 &

Wait until a container is started and access it.
```
docker exec -it ${container-name} /bin/bash
```
Check the Ray cluster status.
```
ray status
```

Click to enlarge

In the container on the head node, start vllm.server to run the large model.
1. Access the container on the head node.
2. Run the large model.
```
vllm serve ${model-mapping-address} --served_model_name ${model-name} --tensor-parallel-size ${GPU_NUM} --gpu_memory_utilization 0.9 --max_model_len 20480 --dtype float16 --enforce-eager
```
  ${model-mapping-address} indicates the container address in the run_cluster.sh script. By default, the value is /root/.cache/huggingface/. If the container address is changed in the script, change the value of ${model-mapping-address} accordingly.

Call a model API to test the model performance.

Call an API to chat.

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "DeepSeek-R1",
  "messages": [{"role": "user", "content": "hello\n"}]
}'

Click to enlarge

If a streaming conversation is required, add the stream parameter.

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "DeepSeek-R1",
  "messages": [{"role": "user", "content": "hello\n"}],
  "stream": true
}'

You can use an EIP to call a model API for chats from your local Postman or your own service.

Parent topic: GPU ECS Best Practices

Previous topic: Deploying a Quantized DeepSeek Model with Ollama on a Single Server (Linux)

Next topic: Using SGLang and Docker to Manually Deploy a DeepSeek-R1 or DeepSeek-V3 Model on Multi-GPU Linux ECSs