Minimum Number of PUs and Maximum Sequence Length Supported by Each Model
The table below describes the minimum number of NPUs supported by different models and the max-model-len length when the inference service is deployed based on vLLM.
The values are obtained when gpu-memory-utilization is set to 0.95. They are the minimum number of NPUs required for service deployment and the recommended maximum length of max-model-len based on the number of PUs, and do not represent the optimal performance.
For Qwen3-14b, if the NPU has 64 GB of memory, you need at least one NPU for inference. With a single NPU, set max-model-len to 32K (where 1K equals 1,024). This means 32 × 1,024.
Test method: When the value of gpu-memory-utilization is 0.95, set the max-model-len value to 4K, then 8K, and finally 16K until reaching its highest possible limit in the static benchmark.
|
No. |
Model |
64 GB Video Memory |
|
|---|---|---|---|
|
Minimum Number of PUs |
Maximum Sequence Length (K) max-model-len |
||
|
1 |
DeepSeek-R1-Distill-Llama-8B |
1 |
32 |
|
2 |
DeepSeek-R1-Distill-Llama-70B |
4 |
32 |
|
3 |
DeepSeek-R1-Distill-Qwen-1.5B |
1 |
32 |
|
4 |
DeepSeek-R1-Distill-Qwen-7B |
1 |
32 |
|
5 |
DeepSeek-R1-Distill-Qwen-14B |
1 |
32 |
|
6 |
DeepSeek-R1-0528-Qwen3-8B |
1 |
32 |
|
7 |
glm-4-9b |
1 |
32 |
|
8 |
llama3-8b |
1 |
32 |
|
9 |
llama3-70b |
4 |
32 |
|
10 |
llama3.1-8b |
1 |
32 |
|
11 |
llama3.1-70b |
4 |
32 |
|
12 |
llama-3.2-1B |
1 |
32 |
|
13 |
llama-3.2-3B |
1 |
32 |
|
14 |
qwen2-0.5b |
1 |
32 |
|
15 |
qwen2-1.5b |
1 |
32 |
|
16 |
qwen2-7b |
1 |
32 |
|
17 |
qwen2-72b |
4 |
32 |
|
18 |
qwen2.5-0.5b |
1 |
32 |
|
19 |
qwen2.5-1.5b |
1 |
32 |
|
20 |
qwen2.5-3b |
1 |
32 |
|
21 |
qwen2.5-7b |
1 |
32 |
|
22 |
qwen2.5-14b |
1 |
32 |
|
23 |
qwen2.5-32b |
2 |
32 |
|
24 |
qwen2.5-72b |
4 |
32 |
|
25 |
qwen3-0.6b |
1 |
32 |
|
26 |
qwen3-1.7b |
1 |
32 |
|
27 |
qwen3-4b |
1 |
32 |
|
28 |
qwen3-8b |
1 |
32 |
|
29 |
qwen3-14b |
1 |
32 |
|
30 |
qwen3-30b-a3b |
2 |
32 |
|
31 |
qwen3-32b |
2 |
32 |
|
32 |
qwen3-235b-a22b |
16 |
64 |
|
33 |
QwQ-32B |
2 |
32 |
|
34 |
bge-reranker-v2-m3 |
1 |
8 |
|
35 |
bge-base-en-v1.5 |
1 |
0.5 |
|
36 |
bge-base-zh-v1.5 |
1 |
0.5 |
|
37 |
bge-large-en-v1.5 |
1 |
0.5 |
|
38 |
bge-large-zh-v1.5 |
1 |
0.5 |
|
39 |
bge-m3 |
1 |
8 |
|
40 |
qwen2-vl-2B |
1 |
8 |
|
41 |
qwen2-vl-7B |
1 |
32 |
|
42 |
qwen2-vl-72B |
4 |
32 |
|
43 |
qwen2.5-vl-7B |
1 |
32 |
|
44 |
qwen2.5-vl-32B |
1 |
32 |
|
45 |
qwen2.5-vl-72B |
4 |
48 |
|
46 |
internvl2.5-26B |
1 |
8 |
|
47 |
InternVL2-Llama3-76B-AWQ |
2 |
8 |
|
48 |
gemma3-27B |
1 |
4 |
|
49 |
Qwen3-Embedding-0.6B |
1 |
32 |
|
50 |
Qwen3-Embedding-4B |
1 |
40 |
|
51 |
Qwen3-Embedding-8B |
1 |
40 |
|
52 |
Qwen3-Reranker-0.6B |
1 |
40 |
|
53 |
Qwen3-Reranker-4B |
1 |
40 |
|
54 |
Qwen3-Reranker-8B |
1 |
40 |
|
55 |
Qwen3-Coder-480B-A35B |
32 |
64 |
|
56 |
internvl3-8B |
1 |
16 |
|
57 |
internvl3-14B |
1 |
16 |
|
58 |
internvl3-38B |
2 |
16 |
|
59 |
internvl3-78B |
4 |
32 |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot