W4A16 Quantification

In foundation model inference, the data types for model weights (weight), inference computations (activation), and kvcache are generally half-precision floating-point (FP16 or BF16). Quantization refers to the process of converting high-bit floating-point numbers to lower-bit data types, such as int4 or int8.

Model quantization includes weight-only quantization, weight-activation quantization, and kvcache quantization.

The general steps for quantization are: 1. Mirror and quantize the floating-point weights and save the quantized weights. 2. Use the quantized weights for inference deployment.

What Is W4A16 Quantization?

W4A16 quantization is a large model compression optimization technique. In this technique, W4 indicates that the model's weights are quantized to 4-bit integers (int4), and A16 indicates that the activations (or inputs/outputs) are kept as 16-bit floating-point numbers (FP16 or BF16).

This quantization method only quantizes the parameters to 4 bits, while the activation values maintain FP16 precision. The advantages of this approach include significantly reducing the model's memory usage and the number of PUs required for deployment (by approximately 75%). It also substantially reduces the incremental inference latency for small batches.

Constraints

The llm-compressor tool supports W4A16 and per-group (group-size=128) quantization, but does not support setting the actorder parameter to group.
Currently, W4A16 quantization is only supported for Qwen series models. For the list of supported model, see Table 1.
Qwen series models quantized with W4A16 only support being launched in graph mode and do not support setting the Qwen series performance optimization environment variables. The specific configuration will be described in detail in the following sections.

Obtaining Quantized Model Weights

You can obtain quantized model weights in either of the following ways:

Method 1: Download the llm-compressor quantized model from the Hugging Face community.

Method 2: After obtaining the FP16/BF16 model weights, use the llm-compressor tool for quantization.

Using the llm-compressor Tool to Quantize Models

This section describes how to use the open-source quantization tool llm-compressor to quantize model weights on an NPU server and then perform quantized inference on the NPU server. For more information about the open-source quantization tool, see llm-compressor.

To use this quantization tool, switch to a specific conda environment:

conda create --name llmcompressor --clone PyTorch-2.5.1  
conda activate llmcompressor

Install llm-compressor.

pip install llmcompressor==0.6.0
pip install transformers==4.51.3
pip install zstandard

Create the quantization script. For the complete llm_compressor_W4A16.py script, see llm_compressor_W4A16.py Script.
Configure the calibration dataset. By default, the W4A16 quantization script uses the mit-han-lab/pile-val-backup dataset for calibration. If you need to specify a different dataset, modify the following fields in the llm_compressor_W4A16.py script:
```
DATASET_ID = "mit-han-lab/pile-val-backup"
DATASET_SPLIT = "validation"

NUM_CALIBRATION_SAMPLES = 256
MAX_SEQUENCE_LENGTH = 512
```
If your current environment cannot access the Hugging Face website to download the dataset, you can download it manually using a web browser and upload it to the server. Then, set DATASET_ID to the server path. The dataset download link is: https://huggingface.co/datasets/mit-han-lab/pile-val-backup/tree/main.

After downloading, upload the val.jsonl.zst file to a custom directory on the server, for example <local-dir>, and decompress it:
```
cd <local-dir>
zstd -d val.jsonl.zst
```
The decompressed file will be named val.jsonl. After decompression, delete the original file to avoid affecting data reading during quantization:
```
rm -f val.jsonl.zst
```
Set DATASET_ID to the directory containing the val.jsonl file:
```
DATASET_ID = "<local-dir>"
```
Configure the quantization algorithm. By default, the W4A16 quantization script provides the Asymmetric Weight Quantization (AWQ) algorithm:
```
# Configure the quantization algorithm to run.
recipe = [
    AWQModifier(ignore=["lm_head"], scheme="W4A16_ASYM", targets=["Linear"]),
]
```
If you want to use symmetric quantization, you can configure scheme="W4A16".
Run the llm_compressor_W4A16.py script to quantize the model. The model weight path needs to be modified according to your actual situation. The quantization time depends on the model size and is expected to take 30 minutes to 3 hours.
```
# Specify the PU number for quantization based on the available PUs on the server. If not set, it defaults to PU 0.
export ASCEND_RT_VISIBLE_DEVICES=0
python llm_compressor_W4A16.py --model-path /home/ma-user/Qwen3-32B/ --quant-path /home/ma-user/Qwen3-32B-quant/ 
```
Parameters:
- --model-path: Path to the original model weights.
- --quant-path: Path to save the quantized weights.
When starting the service, add the following quantization parameters. For details, see Starting a Real-Time Inference Service.
```
 -q compressed-tensors or --quantization compressed-tensors
```
W4A16 models do not support eager or acl-graph modes. The service must be started in ascend_turbo graph mode. The configuration is as follows:
```
--additional-config: {"ascend_turbo_graph_config": {"enabled": true}} 
```
W4A16 models do not support configuring Qwen series performance optimization environment variables. The following environment variables should not be set when starting the service:
```
unset ENABLE_QWEN_HYPERDRIVE_OPT
unset ENABLE_QWEN_MICROBATCH
unset ENABLE_PHASE_AWARE_QKVO_QUANT
unset DISABLE_QWEN_DP_PROJ
```
For the Qwen2.5-72B-instruct and Qwen2-72B-instruct models, before starting the service after quantization, you need to modify the config.json file of the quantized W4A16 model. Change the intermediate_size value from 29568 to 29696. The reason is that the FFN blocks of the Qwen2.5-72B-instruct model are 29568, and the W4A16 quantization group size is 128. 29568/128 = 231, which cannot be evenly divided by TP greater than 1, so the service cannot run normally in TP > 1 scenarios. Changing intermediate_size to 29696 (29696/128 = 232, which can be divided by TP=4 or TP=8) allows for multi-PU inference (refer to community documentation). The service will automatically pad the weights to the specified shape after the configuration is modified. The modification is as follows:
```
vim config.json
# Modify the following configuration item:
"intermediate_size": 29696,
```

llm_compressor_W4A16.py Script

The complete code of the llm_compressor_W4A16.py script is as follows:

import argparse
import os
from functools import partial
import torch
import torch_npu
from torch_npu.contrib import transfer_to_npu

from datasets import load_dataset
from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier
from transformers import AutoModelForCausalLM, AutoTokenizer


os.environ['PYTORCH_NPU_ALLOC_CONF'] = 'expandable_segments:False'

# Select calibration dataset.
DATASET_ID = "mit-han-lab/pile-val-backup"
DATASET_SPLIT = "validation"

# Select number of samples. 256 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 256
MAX_SEQUENCE_LENGTH = 512


def preprocess(example, tokenizer):
    return {
        "text": tokenizer.apply_chat_template(
            [{"role": "user", "content": example["text"]}],
            tokenize=False,
        )
    }


def quantize(model_id, quantized_path):
    # Select model and load it
    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

    ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
    ds = ds.shuffle(seed=42)
    preprocess_func = partial(preprocess, tokenizer=tokenizer)
    ds = ds.map(preprocess_func)

    # Configure the quantization algorithm to run.
    recipe = [
        AWQModifier(ignore=["lm_head"], scheme="W4A16_ASYM", targets=["Linear"]),
    ]

    oneshot(
        model=model,
        dataset=ds,
        recipe=recipe,
        max_seq_length=MAX_SEQUENCE_LENGTH,
        num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    )

    # Save to disk compressed.
    model.save_pretrained(quantized_path, save_compressed=True)
    tokenizer.save_pretrained(quantized_path)


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--model-path", type=str, required=True, help="Path to the input model")
    parser.add_argument("--quant-path", type=str, required=True, help="Path to save the compressed model")
    args = parser.parse_args()
    quantize(args.model_path, args.quant_path)

Parent topic: Quantization

Previous topic: Quantization

Next topic: W8A8 Quantification