Help Center/ Elastic Cloud Server/ Best Practices/ GPU ECS Best Practices/ Deploying a Quantized DeepSeek Model with Ollama on a Single Server (Linux)
Updated on 2025-08-06 GMT+08:00

Deploying a Quantized DeepSeek Model with Ollama on a Single Server (Linux)

Scenarios

Quantization is a process of converting 32-bit floating-point numbers into 8-bit or 4-bit integers by reducing the accuracy of model parameters. This process compresses and optimizes models, reduces the GPU memory and compute required for running models, improves the efficiency, and reduces the energy consumption. However, quantization sacrifices the accuracy. In this section, we will learn how to quickly deploy a quantized DeepSeek model with Ollama.

Solution Architecture

Figure 1 Deploying quantized DeepSeek models with Ollama (Linux)

Advantages

Ollama is used to deploy distilled DeepSeek models from scratch. It thoroughly understands the model runtime dependencies. With a small amount of resources, Ollama can quickly and efficiently connect to services for production and achieves more refined performance and cost control.

Resource Planning

Table 1 Resources and costs

Resource

Description

Cost

VPC

VPC CIDR block: 192.168.0.0/16

Free

VPC subnet

  • AZ: AZ1
  • CIDR block: 192.168.0.0/24

Free

Security group

Inbound rule:

  • Priority: 1
  • Action: Allow
  • Type: IPv4
  • Protocol & Port: TCP:80
  • Source: 0.0.0.0/0

Free

ECS

  • Billing mode: Yearly/Monthly
  • AZ: AZ1
  • Specifications: See Table 2.
  • System disk: 200 GiB
  • EIP: Auto assign
  • EIP type: Dynamic BGP
  • Billed by: Traffic
  • Bandwidth: 100 Mbit/s

The following resources generate costs:

  • Cloud servers
  • EVS disks
  • EIP

For billing details, see Billing Mode Overview.

Table 2 GPU ECS flavors available for running distillation models

No.

Model Name

Minimum Flavor

GPU

0

deepseek-r1:7b

deepseek-r1:8b

p2s.2xlarge.8

V100 (32 GiB) × 1

p2v.4xlarge.8

V100 (16 GiB) × 1

pi2.4xlarge.4

T4 (16 GiB) × 1

g6.18xlarge.7

T4 (16 GiB ) × 1

1

deepseek-r1:14b

p2s.4xlarge.8

V100 (32 GiB) × 1

p2v.8xlarge.8

V100 (16 GiB) × 1

pi2.8xlarge.4

T4 (16 GiB ) × 1

2

deepseek-r1:32b

p2s.8xlarge.8

V100 (32 GiB) × 1

p2v.16xlarge.8

V100 (16 GiB) × 2

3

deepseek-r1:70b

p2s.16xlarge.8

V100 (32 GiB) × 2

Contact Huawei Cloud technical support to select GPU ECSs suitable for your deployment.

Deploying a DeepSeek Distillation Model with Ollama

To manually deploy a quantized DeepSeek model on a Linux ECS with Ollama, do as follows:

  1. Create a GPU ECS.
  2. Check the GPU driver and CUDA versions.
  3. Install Ollama.
  4. Download the large model file.
  5. Run the large model using Ollama.
  6. Call a model API to test the model performance.

Implementation Procedure

  1. Create a GPU ECS.

    1. Select the public image Huawei Cloud EulerOS 2.0 or Ubuntu 22.04 without a driver installed.
      Figure 2 Selecting an image
    2. Select Auto assign for EIP. An EIP will be assigned for downloading dependencies and calling model APIs.

  2. Check the GPU driver and CUDA versions.

    Install the driver of version 535 and CUDA of 12.2. For details, see Manually Installing a Tesla Driver on a GPU-accelerated ECS.

  3. Install Ollama.

    1. Download the Ollama installation script.
      curl -fsSL https://ollama.com/install.sh -o ollama_install.sh
      chmod +x ollama_install.sh
    2. Install Ollama.
      sed -i 's|https://ollama.com/download/|https://github.com/ollama/ollama/releases/download/v0.5.7/|' ollama_install.sh
      sh ollama_install.sh

  4. Download the large model file.

    Download the required model.

    ollama pull deepseek-r1:7b
    ollama pull deepseek-r1:14b
    ollama pull deepseek-r1:32b
    ollama pull deepseek-r1:70b

  5. Run the large model using Ollama.

    Run the large model.

    ollama run deepseek-r1:7b
    ollama run deepseek-r1:14b
    ollama run deepseek-r1:32b
    ollama run deepseek-r1:70

  6. Call a model API to test the model performance. Ollama is fully compatible with OpenAI APIs.

    1. Call an API to view the running model.
      curl http://localhost:11434/v1/models

    2. Call an API to chat.
      curl http: //localhost:11434/api/chat -d '{"model": "deepseek-r1:7b","messages": [{"role": "user", "content": "hello!"}]}'

    The model is deployed and verified. You can use an EIP to call a model API for chats from your local Postman or your own service.

Related Operations

  1. To run on multiple GPUs, add the following parameters to the Ollama service and set CUDA_VISIBLE_DEVICES to the IDs of the GPUs being used.
    vim /etc/systemd/system/ollama.service

  2. Restart Ollama.
    systemctl daemon-reload
    systemctl stop ollama.service
    systemctl start ollama.service