Transformers documentation
SpQR
Get started
Base classes
Models
Preprocessors
Inference
Pipeline API
Generate API
Optimization
Chat with models
Serving
Training
Quantization
OverviewSelecting a quantization methodQuantization conceptsAQLMAutoRoundAWQBitNetbitsandbytescompressed-tensorsEETQFBGEMMFine-grained FP8Four Over SixFP-QuantGGUFGPTQHIGGSHQQMetalMXFP4OptimumQuantoQuarktorchaoSpQRVPTQSINQContribute
Ecosystem integrations
Resources
API
You are viewing main version, which requires installation from source. If you'd like
regular pip install, checkout the latest stable version (v5.8.1).
SpQR
The SpQR quantization algorithm involves a 16x16 tiled bi-level group 3-bit quantization structure with sparse outliers.

To quantize a model with SpQR, refer to the Vahe1994/SpQR repository.
Load a SpQR-quantized model with from_pretrained().
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
quantized_model = AutoModelForCausalLM.from_pretrained(
"elvircrn/Llama-2-7b-SPQR-3Bit-16x16-red_pajama-hf",
dtype=torch.half,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("elvircrn/Llama-2-7b-SPQR-3Bit-16x16-red_pajama-hf")