W8A8 Quantification

What Is W8A8 Quantization?

W8A8 quantization is a technique that quantizes both model weights and activations to 8-bit data. This technique converts high-bit floating-point numbers to 8-bit integers, typically converting weights and activations from 16-bit or 32-bit floating-point numbers to 8-bit integer (int8) format. After quantization, the model weight size is reduced, and using int8 format data for matrix multiplication (MatMul) operations can reduce the computational load, thereby improving inference performance.

W8A8 quantization can reduce the model's memory usage and the number of PUs required for deployment. It also reduces both the initial token latency and the incremental inference latency.

Constraints

Supported Models lists the models supported for W8A8 quantization.
Activation quantization supports dynamic per-token quantization and symmetric quantization.
Weight Quantization supports per-channel quantization and symmetric quantization.

Obtaining Quantized Model Weights

You can obtain quantized model weights in either of the following ways:

Method 1: Download the llm-compressor quantized model from the Hugging Face community.

Method 2: After obtaining the FP16/BF16 model weights, use the llm-compressor tool for quantization.

Using the llm-compressor Tool to Quantize Models

This section describes how to use the open-source quantization tool llm-compressor to quantize model weights on an NPU server and then perform quantized inference on the NPU server. For more information about the open-source quantization tool, see llm-compressor.

To use this quantization tool, switch to a specific conda environment:

conda create --name llmcompressor --clone PyTorch-2.5.1  
conda activate llmcompressor

Install llm-compressor.

pip install llmcompressor==0.6.0
pip install transformers==4.51.3
pip install zstandard

By default, W8A8 quantization uses the HuggingFaceH4/ultrachat_200k dataset as the calibration dataset. To specify a dataset, modify the following fields in the llm_compressor_W8A8.py script:
```
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"

NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048
```
If your current environment cannot access the Hugging Face website to download the dataset, you can download it manually using a web browser and upload it to the server. Then, set DATASET_ID to the server path. The dataset download link is: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k/tree/main.

The original data directory must be retained for the downloaded dataset. The following is an example of the directory structure after the dataset is downloaded, where <local-data-dir> is the custom directory for storing the dataset.
```
<local-data-dir>
   -- ultrachat_200k
       -- data
          -- train_sft-xxx.parquet
          -- train_gen-xxx.parquet
```
Set the local dataset directory above the data folder but do not include the data folder itself:
```
DATASET_ID="<local-data-dir>/ultrachat_200k"
```
For details about the complete llm_compressor_W8A8.py script, see llm_compressor_W8A8.py Script.
Run the llm_compressor_W8A8.py file to quantize the model. The quantization time depends on the model size. It takes about 30 minutes to 3 hours.
```
# Specify the PU number for quantization based on the available PUs on the server. If not set, it defaults to PU 0.
export ASCEND_RT_VISIBLE_DEVICES=0
python llm_compressor_W8A8.py --model-path /home/ma-user/Qwen2.5-72B/ --quant-path /home/ma-user/Qwen2.5-72B-quant/ 
```
Parameters:
- --model-path: Path to the original model weights.
- --quant-path: Path to save the quantized weights.
If the following warning logs are recorded during quantization, ignore them. The quantization function is not affected.
```
get_GPU_usage_nv | WARNING - Pynml library error:
 NVML Shared Library Not Found
```
Add the following command when starting the service. For details, see Starting a Real-Time Inference Service.
```
 -q compressed-tensors or --quantization compressed-tensors
```

llm_compressor_W8A8.py Script

The complete code of the llm_compressor_W8A8.py script is as follows:

import argparse
import os
from functools import partial
import torch
import torch_npu
from torch_npu.contrib import transfer_to_npu

from datasets import load_dataset
from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.utils import dispatch_for_generation
from transformers import AutoModelForCausalLM, AutoTokenizer


os.environ['PYTORCH_NPU_ALLOC_CONF'] = 'expandable_segments:False'

# Select calibration dataset.
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"

# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048


def preprocess(example, tokenizer):
    return {
        "text": tokenizer.apply_chat_template(
            example["messages"],
            tokenize=False,
        )
    }


# Tokenize inputs.
def tokenize(sample, tokenizer):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )


def quantize(model_id, quantized_path):
    # Select model and load it
    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
    tokenizer = AutoTokenizer.from_pretrained(model_id)

    # Load dataset and preprocess.
    ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
    ds = ds.shuffle(seed=42)
    preprocess_func = partial(preprocess, tokenizer=tokenizer)
    ds = ds.map(preprocess_func)
    tokenize_func = partial(tokenize, tokenizer=tokenizer)
    ds = ds.map(tokenize_func, remove_columns=ds.column_names)

    # Configure algorithms. In this case, we:
    #   * apply SmoothQuant to make the activations easier to quantize
    #   * quantize the weights to int8 with GPTQ (static per channel)
    #   * quantize the activations to int8 (dynamic per token)
    recipe = [
        SmoothQuantModifier(smoothing_strength=0.8),
        GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
    ]

    oneshot(
        model=model,
        dataset=ds,
        recipe=recipe,
        max_seq_length=MAX_SEQUENCE_LENGTH,
        num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    )

    # Save to disk compressed.
    model.save_pretrained(quantized_path, save_compressed=True)
    tokenizer.save_pretrained(quantized_path)


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--model-path", type=str, required=True, help="Path to the input model")
    parser.add_argument("--quant-path", type=str, required=True, help="Path to save the compressed model")
    args = parser.parse_args()
    quantize(args.model_path, args.quant_path)

Parent topic: Quantization

Previous topic: W4A16 Quantification

Next topic: Prefix Caching