Help Center/ Elastic Cloud Server/ Best Practices/ GPU ECS Best Practices/ Using SGLang and Docker to Manually Deploy a DeepSeek-R1 or DeepSeek-V3 Model on Multi-GPU Linux ECSs

Updated on 2025-08-06 GMT+08:00

View PDF

Using SGLang and Docker to Manually Deploy a DeepSeek-R1 or DeepSeek-V3 Model on Multi-GPU Linux ECSs

Scenarios

DeepSeek-V3 and DeepSeek-R1 are two high-performance large language models launched by DeepSeek. DeepSeek-R1 is an inference model that is designed for math, code generation, and complex logical inference. Reinforcement learning (RL) is applied to this model to improve the inference performance. DeepSeek-V3 is a model that focuses on general tasks such as natural language processing, knowledge Q&A, and content creation. It aims to balance high performance and low cost and is suitable for applications such as intelligent customer service and personalized recommendation. In this section, we will learn how to use SGLang to quickly deploy a DeepSeek-R1 or DeepSeek-V3 model.

Solution Architecture

This solution requires at least one primary node and one secondary node.

Figure 1 Manual deployment of a DeepSeek-R1 or DeepSeek-V3 model using SGLang and Docker (Linux)
Click to enlarge

Advantages

You will be able to manually deploy DeepSeek-R1 and DeepSeek-V3 models using SGLang and Docker, and better understand model dependencies. This will give you a chance to experience the superb inference performance of DeepSeek-R1 and DeepSeek-V3 models. The SGLang framework has a built-in data parallel router that automatically connects the router to all workers after they are started. You can use SGLang easily without requiring the external Ray cluster framework.

Resource Planning

**Table 1** Resources and costs
Resource	Description	Cost
VPC	VPC CIDR block: 192.168.0.0/16	Free
VPC subnet	AZ: AZ1 CIDR block: 192.168.0.0/24	Free
Security group	Inbound rule: Priority: 1 Action: Allow Type: IPv4 Protocol & Port: TCP:80 Source: 0.0.0.0/0	Free
ECS	Billing mode: Yearly/Monthly AZ: AZ1 Specifications: See Table 2. System disk: 200 GiB Data disk: 1,000 GiB EIP: Auto assign EIP type: Dynamic BGP Billed by: Traffic Bandwidth: 100 Mbit/s	The following resources generate costs: Cloud servers EVS disks EIP For billing details, see Billing Mode Overview.

**Table 2** GPU ECS flavors available for running distillation models
No.	Model Name	Minimum Flavor	GPU
0	DeepSeek-R1 DeepSeek-V3	p2s.16xlarge.8	V100 (32 GiB) × 8 GPUs × 8 nodes
		p2v.16xlarge.8	V100 (16 GiB) × 8 GPUs × 16 nodes
		pi2.4xlarge.4	T4 (16 GiB) × 8 GPUs × 16 nodes

Manually Deploying a DeepSeek-R1 or DeepSeek-V3 model Using SGLang and Docker on Multi-GPU Linux ECSs

To use SGLang and Docker to manually deploy a DeepSeek-R1 or DeepSeek-V3 model on multi-GPU Linux ECSs, do as follows:

Create a GPU ECS.
Check the GPU driver and CUDA versions.
Install Docker.
Install the NVIDIA Container Toolkit.
Install dependencies, such as modelscope.
Download a Docker image.
Download the DeepSeek-R1 or DeepSeek-V3 model file.
Start the primary and all secondary nodes of SGLang.
Call a model API to test the model performance.

Procedure

Create a GPU ECS.
1. Select the public image Huawei Cloud EulerOS 2.0 or Ubuntu 22.04 without a driver installed. The following uses Ubuntu 22.04 is used as an example.
  Figure 2 Selecting an image
2. Select Auto assign for EIP. An EIP will be assigned for downloading dependencies and calling model APIs.
Check the GPU driver and CUDA versions.

Install the driver of version 535 and CUDA of 12.2. For details, see Manually Installing a Tesla Driver on a GPU-accelerated ECS.

Install Docker.

Update the package index and install dependencies.

apt-get update
apt-get install -y ca-certificates curl gnupg lsb-release

Add Docker's official GPG key.

mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

Set the APT source for Docker.

echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

Install Docker Engine.

apt update
apt install docker-ce docker-ce-cli containerd.io docker-compose-plugin

Configure Docker Hub.

cat <<EOF > /etc/docker/daemon.json
{
  "registry-mirrors": [
    "https://docker.m.daocloud.io",      
    "https://registry.cn-hangzhou.aliyuncs.com" 
  ]
}
EOF
systemctl restart docker

Check whether Docker is installed successfully.
```
docker --version
```

Install the NVIDIA Container Toolkit.

Add NVIDIA's official GPG key.

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Set the APT source for the NVIDIA Container Toolkit.

sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list

Update the index package and install the NVIDIA Container Toolkit.
```
apt update
apt install -y nvidia-container-toolkit
```

Configure Docker to use the NVIDIA Container Toolkit.

nvidia-ctk runtime configure --runtime=docker
systemctl restart docker

Install dependencies, such as modelscope.
1. Update the pip.
```
python -m pip install --upgrade pip -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
```
2. Install modelscope.
```
pip install modelscope -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
```
  ModelScope is an open-source model community in China. It is fast to download models from ModelScope if you are in China. However, if you are outside China, download models from Hugging Face.
Download a Docker image.

You can download the latest container image provided by SGLang or the image created by the Huawei Cloud heterogeneous computing team.
1. Download the latest image from the SGLang official website.
```
docker pull lmsysorg/sglang:latest
```
2. Download an image created by the Huawei Cloud heterogeneous computing team.
```
docker pull swr.cn-north-9.myhuaweicloud.com/hgcs/lmsysorg/sglang:latest
```
Download the DeepSeek-R1 or DeepSeek-V3 model file.
1. Create a script for downloading a model.
```
vim download_models.py
```
  Add the following content to the script:
```
from modelscope import snapshot_download
model_dir = snapshot_download('deepseek-ai/DeepSeek-R1', cache_dir='/root', revision='master')
```
  The model name DeepSeek-R1 is used as an example. You can change it to DeepSeek-V3. The local directory for storing the model is /root. You can change it as needed.
2. Download the model.
```
python3 download_models.py
```
  The total size of the model is 642 GB. The download may take 24 hours or longer, which is related to the EIP bandwidth.

Start the primary and all secondary nodes of SGLang.

Case 1: A RoCE network is available.

Start the primary node of SGLang.

Node 0:

docker run --gpus all \ # Use all GPUs.
       --shm-size 512g \ # Size of the shared memory between the server and container. Set it based on the server configuration.
       -e NCCL_IB_GID_INDEX=3 \ # Use index 3, which corresponds to RoCE v2 routing.
       -e NCCL_IB_HCA='^=mlx5_bond_0' \ # Specify the NIC for RoCE communication.
       -e NCCL_SOCKET_IFNAME=bond0  \  # Specify the TCP/IP NIC (for example, bond0) for NCCL, which should be separated from the IB NIC to avoid interference.
       -e GLOO_SOCKET_IFNAME=bond0  \ # Specify the TCP/IP NIC (for example, bond0) for GLOO, which should be separated from the IB NIC to avoid interference.
       -v /mnt/paas/models:/root/.cache/huggingface \ # Mount point of the model path.
       --ipc=host --network=host --privileged \ # Use the host network.
       swr.cn-north-9.myhuaweicloud.com/hgcs/lmsysorg/sglang:latest \ # Container image name.
       python3 -m sglang.launch_server --model /root/.cache/huggingface/DeepSeek-V3 \ # Specify the model path.
       --served-model-name deepseek-v3 \ # Specify the model name.
       --tp 16 \ # Two nodes, each with 8 H20 GPUs. Set tensor parallelism (TP) to 16.
       --nnodes 2 \ # The total number of nodes.
       --node-rank 0 \ # Set the primary node to 0 and assign ranks to other nodes incrementally based on their orders.
       --host 0.0.0.0 --port 8000 \ # Set the IP address and port of the current node.
       --dist-init-addr 192.168.1.143:30000 \ # Set the IP address and port of the primary node.
       --trust-remote-code

Start the secondary node of SGLang.

docker run --gpus all \ 
--shm-size 512g \
-e NCCL_IB_GID_INDEX=3 \
-e NCCL_IB_HCA='^=mlx5_bond_0' \
-e GLOO_SOCKET_IFNAME=bond0 \
-v /mnt/paas/models:/root/.cache/huggingface \
--ipc=host --network=host \
--privileged swr.cn-north-9.myhuaweicloud.com/hgcs/lmsysorg/sglang:latest python3 -m sglang.launch_server \
--model /root/.cache/huggingface/DeepSeek-V3 \
--served-model-name deepseek-v3 \
--tp 16 
--nnodes 2 
--node-rank 1 
--host 0.0.0.0 --port 8000 
--dist-init-addr 192.168.1.143:30000 
--trust-remote-code

Add optimization parameters.
- Display specified FP8.
  --quantization fp8
  
  --kv-cache-dtype fp8_e5m2
- Enable MLA optimization.
  --enable-torch-compile # Enable PyTorch just-in-time (JIT) compilation optimization.
  
  --enable-flashinfer-mla # Enable FlashInfer MLA attention acceleration.

Case 2: No RoCE network is available.

Start the primary node of SGLang.

Node 0:

docker run --gpus all \ # Use all GPUs.
       --shm-size 512g \  # Size of the shared memory between the server and container. Set this parameter based on the server configuration.
       -e GLOO_SOCKET_IFNAME=eth0 \
       -v /mnt/paas/models:/root/.cache/huggingface \ # Mount point of the model path.
       --ipc=host --network=host   \ # Use the host network.
       swr.cn-north-9.myhuaweicloud.com/hgcs/lmsysorg/sglang:latest \ # Container image name.
       python3 -m sglang.launch_server --model /root/.cache/huggingface/DeepSeek-V3 \ # Specify the model path.
       --served-model-name deepseek-v3 \ # Specify the model name.
       --tp 16 \ # Two nodes, each with 8 H20 GPUs. Set tensor parallelism (TP) to 16.
       --nnodes 2 \ # The total number of nodes.
       --node-rank 0 \ # Set the primary node to 0 and assign ranks to other nodes incrementally based on their orders.
       --host 0.0.0.0 --port 8000 \ # Set the IP address and port of the current node.
       --dist-init-addr 192.168.1.143:30000 \ # Set the IP address and port of the primary node.
       --trust-remote-code

Start the secondary node of SGLang.

docker run --gpus all \ 
--shm-size 512g \
-e GLOO_SOCKET_IFNAME=eth0\
-v /mnt/paas/models:/root/.cache/huggingface \
--ipc=host --network=host \
swr.cn-north-9.myhuaweicloud.com/hgcs/lmsysorg/sglang:latest 
python3 -m sglang.launch_server \
--model /root/.cache/huggingface/DeepSeek-V3 \
--served-model-name deepseek-v3 \
--tp 16 \
--nnodes 2 \
--node-rank 1 \
--host 0.0.0.0 --port 8000 \
--dist-init-addr 192.168.1.143:30000 \
--trust-remote-code

Add optimization parameters.
- Display specified FP8.
  --quantization fp8
  
  --kv-cache-dtype fp8_e5m2
- Enable MLA optimization.
  --enable-torch-compile # Enable PyTorch JIT compilation optimization.
  
  --enable-flashinfer-mla # Enable FlashInfer MLA attention acceleration.

Call a model API to test the model performance.

Call an API to chat.

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "DeepSeek-R1",
  "messages": [{"role": "user", "content": "hello\n"}]
}'

Click to enlarge

If streaming conversation is required, add the stream parameter.

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "DeepSeek-R1",
  "messages": [{"role": "user", "content": "hello\n"}],
  "stream": true
}'

You can use an EIP to call a model API for chats from your local Postman or your own service.

FAQs

Symptom: When the primary SGLang node is started, the error message "SGLang only supports sm75 and above" is displayed.

Click to enlarge

Cause: The GPU compute capability is insufficient. The compute capability must be at least SM7.5.

Solution: Replace the GPU with one whose compute capability is greater than or equal to SM7.5, for example, T4.

Parent topic: GPU ECS Best Practices

Previous topic: Using Ray, Docker, and vLLM to Manually Deploy a DeepSeek-R1 or DeepSeek-V3 Model on Multi-GPU Linux ECSs

Next topic: Deploying a Distilled DeepSeek Model with vLLM on a Single Server (Linux)

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel

For any further questions, feel free to contact us through the chatbot.

Chatbot