Updated on 2025-11-04 GMT+08:00

Minimum Number of PUs and Maximum Sequence Length Supported by Each Model

The table below describes the minimum number of NPUs supported by different models and the max-model-len length when the inference service is deployed based on vLLM.

The values are obtained when gpu-memory-utilization is set to 0.95. They are the minimum number of NPUs required for service deployment and the recommended maximum length of max-model-len based on the number of PUs, and do not represent the optimal performance.

For Qwen3-14b, if the NPU has 64 GB of memory, you need at least one NPU for inference. With a single NPU, set max-model-len to 32K (where 1K equals 1,024). This means 32 × 1,024.

Test method: When the value of gpu-memory-utilization is 0.95, set the max-model-len value to 4K, then 8K, and finally 16K until reaching its highest possible limit in the static benchmark.

Table 1 Minimum number of PUs and maximum sequence length supported by vLLM-based inference

No.

Model

64 GB Video Memory

Minimum Number of PUs

Maximum Sequence Length (K)

max-model-len

1

DeepSeek-R1-Distill-Llama-8B

1

32

2

DeepSeek-R1-Distill-Llama-70B

4

32

3

DeepSeek-R1-Distill-Qwen-1.5B

1

32

4

DeepSeek-R1-Distill-Qwen-7B

1

32

5

DeepSeek-R1-Distill-Qwen-14B

1

32

6

DeepSeek-R1-0528-Qwen3-8B

1

32

7

glm-4-9b

1

32

8

llama3-8b

1

32

9

llama3-70b

4

32

10

llama3.1-8b

1

32

11

llama3.1-70b

4

32

12

llama-3.2-1B

1

32

13

llama-3.2-3B

1

32

14

qwen2-0.5b

1

32

15

qwen2-1.5b

1

32

16

qwen2-7b

1

32

17

qwen2-72b

4

32

18

qwen2.5-0.5b

1

32

19

qwen2.5-1.5b

1

32

20

qwen2.5-3b

1

32

21

qwen2.5-7b

1

32

22

qwen2.5-14b

1

32

23

qwen2.5-32b

2

32

24

qwen2.5-72b

4

32

25

qwen3-0.6b

1

32

26

qwen3-1.7b

1

32

27

qwen3-4b

1

32

28

qwen3-8b

1

32

29

qwen3-14b

1

32

30

qwen3-30b-a3b

2

32

31

qwen3-32b

2

32

32

qwen3-235b-a22b

16

64

33

QwQ-32B

2

32

34

bge-reranker-v2-m3

1

8

35

bge-base-en-v1.5

1

0.5

36

bge-base-zh-v1.5

1

0.5

37

bge-large-en-v1.5

1

0.5

38

bge-large-zh-v1.5

1

0.5

39

bge-m3

1

8

40

qwen2-vl-2B

1

8

41

qwen2-vl-7B

1

32

42

qwen2-vl-72B

4

32

43

qwen2.5-vl-7B

1

32

44

qwen2.5-vl-32B

1

32

45

qwen2.5-vl-72B

4

48

46

internvl2.5-26B

1

8

47

InternVL2-Llama3-76B-AWQ

2

8

48

gemma3-27B

1

4

49

Qwen3-Embedding-0.6B

1

32

50

Qwen3-Embedding-4B

1

40

51

Qwen3-Embedding-8B

1

40

52

Qwen3-Reranker-0.6B

1

40

53

Qwen3-Reranker-4B

1

40

54

Qwen3-Reranker-8B

1

40

55

Qwen3-Coder-480B-A35B

32

64

56

internvl3-8B

1

16

57

internvl3-14B

1

16

58

internvl3-38B

2

16

59

internvl3-78B

4

32