Help Center/ Elastic Cloud Server/ Best Practices/ GPU ECS Best Practices/ Deploying a Distilled DeepSeek Model with Ollama on a Single Server (Linux)
Updated on 2025-08-06 GMT+08:00

Deploying a Distilled DeepSeek Model with Ollama on a Single Server (Linux)

Scenarios

Distillation is a technology that transfers knowledge of a large pre-trained model into a smaller model. It is suitable for scenarios where smaller and more efficient models are required without significant loss of accuracy. In this section, we will learn how to use vLLM to quickly deploy a distilled DeepSeek model.

Solution Architecture

Figure 1 Deploying distilled DeepSeek models with vLLM (Linux)

Advantages

vLLM is used to deploy distilled DeepSeek models from scratch in conda. It thoroughly understands the model runtime dependencies. With a small amount of resources, Ollama can quickly and efficiently connect to services for production and achieves more refined performance and cost control.

Resource Planning

Table 1 Resources and costs

Resource

Description

Cost

VPC

VPC CIDR block: 192.168.0.0/16

Free

VPC subnet

  • AZ: AZ1
  • CIDR block: 192.168.0.0/24

Free

Security group

Inbound rule:

  • Priority: 1
  • Action: Allow
  • Type: IPv4
  • Protocol & Port: TCP:80
  • Source: 0.0.0.0/0

Free

ECS

  • Billing mode: Yearly/Monthly
  • AZ: AZ1
  • Specifications: See Table 2.
  • System disk: 200 GiB
  • EIP: Auto assign
  • EIP type: Dynamic BGP
  • Billed by: Traffic
  • Bandwidth: 100 Mbit/s

The following resources generate costs:

  • Cloud servers
  • EVS disks
  • EIP

For billing details, see Billing Mode Overview.

Table 2 GPU ECS flavors available for running distilled DeepSeek models

No.

Model Name

Minimum Flavor

GPU

0

DeepSeek-R1-Distill-Qwen-7B

DeepSeek-R1-Distill-Llama-8B

p2s.2xlarge.8

V100 (32 GiB) × 1

p2v.4xlarge.8

V100 (16 GiB) × 2

pi2.4xlarge.4

T4 (16 GiB) × 2

g6.18xlarge.7

T4 (16 GiB) × 2

1

DeepSeek-R1-Distill-Qwen-14B

p2s.4xlarge.8

V100 (32 GiB) × 2

p2v.8xlarge.8

V100 (16 GiB) × 4

pi2.8xlarge.4

T4 (16 GiB) × 4

2

DeepSeek-R1-Distill-Qwen-32B

p2s.8xlarge.8

V100 (32 GiB) × 4

p2v.16xlarge.8

V100 (16 GiB) × 8

3

DeepSeek-R1-Distill-Llama-70B

p2s.16xlarge.8

V100 (32 GiB) × 8

Contact Huawei Cloud technical support to select GPU ECSs suitable for your deployment.

Manually Deploying a Distilled DeepSeek Model with vLLM

To manually deploy a distilled DeepSeek model on a Linux ECS with vLLM, do as follows:

  1. Create a GPU ECS.
  2. Check the GPU driver and CUDA versions.
  3. Create a conda virtual environment.
  4. Install dependencies, such as vLLM.
  5. Download the large model file.
  6. Start vllm.server to run the large model.
  7. Call a model API to test the model performance.

Procedure

  1. Create a GPU ECS.

    1. Select the public image Huawei Cloud EulerOS 2.0 or Ubuntu 22.04 without a driver installed.
      Figure 2 Selecting an image
    2. Select Auto assign for EIP. EIPs will be assigned for downloading dependencies and calling model APIs.

  2. Check the GPU driver and CUDA versions.

    Install the driver of version 535 and CUDA of 12.2. For details, see Manually Installing a Tesla Driver on a GPU-accelerated ECS.

  3. Create a conda virtual environment.

    1. Download the miniconda installation package.
      wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
    2. Install miniconda.
      bash Miniconda3-latest-Linux-x86_64.sh
    3. Add the conda environment variable to the startup file.
      echo 'export PATH="$HOME/miniconda3/bin:$PATH"' >> ~/.bashrc 
      source ~/.bashrc
    4. Create a Python 3.10 virtual environment.
      conda create -n vllm-ds python=3.10
      conda activate vllm-ds
      conda install numpy

  4. Install dependencies, such as vLLM.

    1. Update pip.
      python -m pip install --upgrade pip -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
    2. Install vLLM.
      pip install vllm -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

      You can run the vllm –version command to view the installed vLLM version.

    3. Install modelscope.
      pip install modelscope -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

      ModelScope is an open-source model community in China. It is fast to download models from ModelScope if you are in China. However, if you are outside China, download models from Hugging Face.

  5. Download the large model file.

    1. Create a script for downloading a model.
      vim download_models.py

      Add the following content into the script:

      from modelscope import snapshot_download
      model_dir = snapshot_download('deepseek-ai/DeepSeek-R1-Distill-Qwen-7B', cache_dir='/root', revision='master')

      The model name DeepSeek-R1-Distill-Qwen-7B is used as an example. You can replace it with the required model by referring to Table 2. The local path for storing the model is /root. You can change it as needed.

    2. Download the model.
      python3 download_models.py

      Wait until the model is downloaded.

  6. Start vllm.server to run the large model.

    Run the foundation model.
    python -m vllm.entrypoints.openai.api_server --model /root/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --served-model-name DeepSeek-R1-Distill-Qwen-7B --max-model-len=2048 &
    1. If the ECS uses multiple GPUs, add -tp ${number-of-GPUs}. For example, if there are two GPUs, add -tp 2.
    2. V100 or T4 GPUs cannot use BF16 precision and can only use float16. The --dtype float16 parameter must be added.
      python -m vllm.entrypoints.openai.api_server --model /root/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --served-model-name DeepSeek-R1-Distill-Qwen-7B --max-model-len=2048 --dtype float16 -tp 2 &
    3. After the model is loaded, if the GPU memory is insufficient, add --enforce-eager to use the eager mode and disable the CUDA graph to reduce the GPU memory usage.
      python -m vllm.entrypoints.openai.api_server --model /root/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --served-model-name DeepSeek-R1-Distill-Qwen-7B --max-model-len=2048 –dtype float16 –enforce-eager &
    4. If the GPU memory is still insufficient, replace the ECS with another one that has a larger flavor based on Table 2.
    5. After the model is started, the default listening port is 8000.

  7. Call a model API to test the model performance.

    1. Call an API to view the running model.
      curl http://localhost:8000/v1/models

    2. Call an API to chat.
      curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "DeepSeek-R1-Distill-Qwen-7B",
        "messages": [{"role": "user", "content": "hello\n"}]
      }'

    The model is deployed and verified. You can use an EIP to call a model API for chats from your local Postman or your own service.