Instructions to use varjosoft/GLM-5.1-Open-TQ3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use varjosoft/GLM-5.1-Open-TQ3 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="varjosoft/GLM-5.1-Open-TQ3")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("varjosoft/GLM-5.1-Open-TQ3")
model = AutoModelForCausalLM.from_pretrained("varjosoft/GLM-5.1-Open-TQ3")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use varjosoft/GLM-5.1-Open-TQ3 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "varjosoft/GLM-5.1-Open-TQ3"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "varjosoft/GLM-5.1-Open-TQ3",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/varjosoft/GLM-5.1-Open-TQ3

SGLang

How to use varjosoft/GLM-5.1-Open-TQ3 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "varjosoft/GLM-5.1-Open-TQ3" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "varjosoft/GLM-5.1-Open-TQ3",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "varjosoft/GLM-5.1-Open-TQ3" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "varjosoft/GLM-5.1-Open-TQ3",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use varjosoft/GLM-5.1-Open-TQ3 with Docker Model Runner:
```
docker model run hf.co/varjosoft/GLM-5.1-Open-TQ3
```

GLM-5.1 TQ3 (3-bit weight compression)

Native TQ3 checkpoint of zai-org/GLM-5 (769B MoE, 40B active).

Compression

	BF16	TQ3
Checkpoint size	~1,510 GB	309 GB
Compression ratio	1x	4.9x

Created using turboquant-plus-vllm streaming checkpoint creation on a $0.11/hr CPU instance. Total cost: $0.84.

Status

Not yet tested on GPU. This checkpoint was created and uploaded automatically. Quality validation on a multi-GPU setup is pending.

The same code path was validated on GLM-4.7-Flash (355B, same MoE architecture with 64 experts) where it loaded successfully and scored correctly on all test prompts with 13.3 GB GPU memory.

Architecture

GLM-5.1 uses the Glm4MoeLiteNaiveMoe architecture:

769B total parameters, 40B active per token
256 routed experts, 8 active per token, 1 shared expert
78 layers, hidden_size=6144
Multi-head Latent Attention (MLA)
First 3 layers are dense (not MoE)
200K context window

How it works

The WHT rotation + Gaussian Lloyd-Max codebook from TurboQuant (ICLR 2026). After a random Walsh-Hadamard rotation, weight distributions become near-Gaussian, making them efficiently quantizable with 8 centroids (3-bit) per 128-element group. Zero calibration data needed.

The checkpoint stores packed 3-bit indices + per-group norms. The loader handles:

Per-expert 2D → fused 3D regrouping (gate_proj + up_proj → gate_up_proj fusion)
Router/gate weight decompression in-place
Meta-device model creation for low-memory loading

Usage

pip install turboquant-plus-vllm@git+https://github.com/varjoranta/turboquant-vllm.git

from turboquant_vllm import load_tq3_model

model, tokenizer = load_tq3_model("varjosoft/GLM-5.1-Open-TQ3", device="cuda")
# Requires multi-GPU setup — see requirements below

GPU requirements for inference

Setup	Total VRAM	Per-GPU	Cost/hr (Verda)
8× A100 80GB	640 GB	45 GB	$10.32
4× H200 141GB	564 GB	90 GB	$13.56
2× B300 262GB	524 GB	180 GB	$13.98

Without TQ3, the BF16 model requires 1,510 GB VRAM (minimum 8× B300 at $55.92/hr).

Software requirements

transformers >= 5.5.0
turboquant-plus-vllm (GitHub)
PyTorch with CUDA

Comparison with other quantizations

Method	Size	Calibration	Format	Target
This (TQ3)	309 GB (4.9x)	None	Safetensors	GPU serving (vLLM/PyTorch)
Unsloth Dynamic 2-bit	236 GB (6.4x)	300K+ tokens	GGUF	Local/CPU (llama.cpp)
BF16 original	1,510 GB	N/A	Safetensors	8× B300+

License

MIT (same as base model). Created by Varjosoft Oy.

Downloads last month: 567

Safetensors

Model size

289B params

Tensor type

F32

F16

Model tree for varjosoft/GLM-5.1-Open-TQ3

Base model

zai-org/GLM-5

Finetuned

(37)

this model

Paper for varjosoft/GLM-5.1-Open-TQ3

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Paper • 2504.19874 • Published Apr 28, 2025 • 34