Help Center/ Elastic Cloud Server/ Best Practices/ GPU ECS Best Practices/ Deploying a Quantized DeepSeek Model with Ollama on a Single Server (Linux)

Updated on 2025-09-23 GMT+08:00

View PDF

Deploying a Quantized DeepSeek Model with Ollama on a Single Server (Linux)

Scenarios

Quantization is a process of converting 32-bit floating-point numbers into 8-bit or 4-bit integers by reducing the accuracy of model parameters. This process compresses and optimizes models, reduces the GPU memory and compute required for running models, improves the efficiency, and reduces the energy consumption. However, quantization sacrifices the accuracy. In this section, we will learn how to quickly deploy a quantized DeepSeek model with Ollama.

Solution Architecture

Figure 1 Deploying quantized DeepSeek models with Ollama (Linux)
Click to enlarge

Advantages

Ollama is used to deploy distilled DeepSeek models from scratch. It thoroughly understands the model runtime dependencies. With a small number of resources, Ollama can quickly and efficiently connect to services for production and achieves more refined performance and cost control.

Resource Planning

**Table 1** Resources and costs
Resource	Description	Cost
VPC	VPC CIDR block: 192.168.0.0/16	Free
VPC subnet	AZ: AZ1 CIDR block: 192.168.0.0/24	Free
Security group	Inbound rule: Priority: 1 Action: Allow Type: IPv4 Protocol & Port: TCP:80 Source: 0.0.0.0/0	Free
ECS	Billing mode: Yearly/Monthly AZ: AZ1 Specifications: See Table 2. System disk: 200 GiB EIP: Auto assign EIP type: Dynamic BGP Billed by: Traffic Bandwidth: 100 Mbit/s	The following resources generate costs: Cloud servers EVS disks EIP For billing details, see Billing Mode Overview.

**Table 2** GPU ECS flavors available for running distillation models
No.	Model Name	Minimum Flavor	GPU
0	deepseek-r1:7b deepseek-r1:8b	p2s.2xlarge.8	V100 (32 GiB) × 1
		p2v.4xlarge.8	V100 (16 GiB) × 1
		pi2.4xlarge.4	T4 (16 GiB) × 1
		g6.18xlarge.7	T4 (16 GiB ) × 1
1	deepseek-r1:14b	p2s.4xlarge.8	V100 (32 GiB) × 1
		p2v.8xlarge.8	V100 (16 GiB) × 1
		pi2.8xlarge.4	T4 (16 GiB ) × 1
2	deepseek-r1:32b	p2s.8xlarge.8	V100 (32 GiB) × 1
2	deepseek-r1:32b	p2v.16xlarge.8	V100 (16 GiB) × 2
3	deepseek-r1:70b	p2s.16xlarge.8	V100 (32 GiB) × 2

Contact Huawei Cloud technical support to select GPU ECSs suitable for your deployment.

Deploying a DeepSeek Distillation Model with Ollama

To manually deploy a quantized DeepSeek model on a Linux ECS with Ollama, do as follows:

Create a GPU ECS.
Check the GPU driver and CUDA versions.
Install Ollama.
Download the large model file.
Run the large model using Ollama.
Call a model API to test the model performance.

Implementation Procedure

Create a GPU ECS.
1. Select the public image Huawei Cloud EulerOS 2.0 or Ubuntu 22.04 without a driver installed.
  Figure 2 Selecting an image
2. Select Auto assign for EIP. An EIP will be assigned for downloading dependencies and calling model APIs.
Check the GPU driver and CUDA versions.

Install the driver of version 535 and CUDA of 12.2. For details, see Manually Installing a Tesla Driver on a GPU-accelerated ECS.

Install Ollama.

Download the Ollama installation script.

curl -fsSL https://ollama.com/install.sh -o ollama_install.sh
chmod +x ollama_install.sh

Install Ollama.

sed -i 's|https://ollama.com/download/|https://github.com/ollama/ollama/releases/download/v0.5.7/|' ollama_install.sh
sh ollama_install.sh

Download the large model file.

Download the required model.

ollama pull deepseek-r1:7b
ollama pull deepseek-r1:14b
ollama pull deepseek-r1:32b
ollama pull deepseek-r1:70b

Run the large model using Ollama.

Run the large model.

ollama run deepseek-r1:7b
ollama run deepseek-r1:14b
ollama run deepseek-r1:32b
ollama run deepseek-r1:70

Call a model API to test the model performance. Ollama is fully compatible with OpenAI APIs.
1. Call an API to view the running model.
```
curl http://localhost:11434/v1/models
```
2. Call an API to chat.
```
curl http: //localhost:11434/api/chat -d '{"model": "deepseek-r1:7b","messages": [{"role": "user", "content": "hello!"}]}'
```
The model is deployed and verified. You can use an EIP to call a model API for chats from your local Postman or your own service.

Related Operations

To run on multiple GPUs, add the following parameters to the Ollama service and set CUDA_VISIBLE_DEVICES to the IDs of the GPUs being used.
```
vim /etc/systemd/system/ollama.service
```

Restart Ollama.

systemctl daemon-reload
systemctl stop ollama.service
systemctl start ollama.service

Parent topic: GPU ECS Best Practices

Previous topic: Deploying an NGC Container Environment to Create a Deep Learning Development Environment

Next topic: Using Ray, Docker, and vLLM to Manually Deploy a DeepSeek-R1 or DeepSeek-V3 Model on Multi-GPU Linux ECSs

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel

For any further questions, feel free to contact us through the chatbot.

Chatbot