Instructions to use varjosoft/GLM-5.1-Open-TQ3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use varjosoft/GLM-5.1-Open-TQ3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="varjosoft/GLM-5.1-Open-TQ3") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("varjosoft/GLM-5.1-Open-TQ3") model = AutoModelForCausalLM.from_pretrained("varjosoft/GLM-5.1-Open-TQ3") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use varjosoft/GLM-5.1-Open-TQ3 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "varjosoft/GLM-5.1-Open-TQ3" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "varjosoft/GLM-5.1-Open-TQ3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/varjosoft/GLM-5.1-Open-TQ3
- SGLang
How to use varjosoft/GLM-5.1-Open-TQ3 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "varjosoft/GLM-5.1-Open-TQ3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "varjosoft/GLM-5.1-Open-TQ3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "varjosoft/GLM-5.1-Open-TQ3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "varjosoft/GLM-5.1-Open-TQ3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use varjosoft/GLM-5.1-Open-TQ3 with Docker Model Runner:
docker model run hf.co/varjosoft/GLM-5.1-Open-TQ3
GLM-5.1 TQ3 (3-bit weight compression)
Native TQ3 checkpoint of zai-org/GLM-5 (769B MoE, 40B active).
Compression
| BF16 | TQ3 | |
|---|---|---|
| Checkpoint size | ~1,510 GB | 309 GB |
| Compression ratio | 1x | 4.9x |
Created using turboquant-plus-vllm streaming checkpoint creation on a $0.11/hr CPU instance. Total cost: $0.84.
Status
Not yet tested on GPU. This checkpoint was created and uploaded automatically. Quality validation on a multi-GPU setup is pending.
The same code path was validated on GLM-4.7-Flash (355B, same MoE architecture with 64 experts) where it loaded successfully and scored correctly on all test prompts with 13.3 GB GPU memory.
Architecture
GLM-5.1 uses the Glm4MoeLiteNaiveMoe architecture:
- 769B total parameters, 40B active per token
- 256 routed experts, 8 active per token, 1 shared expert
- 78 layers, hidden_size=6144
- Multi-head Latent Attention (MLA)
- First 3 layers are dense (not MoE)
- 200K context window
How it works
The WHT rotation + Gaussian Lloyd-Max codebook from TurboQuant (ICLR 2026). After a random Walsh-Hadamard rotation, weight distributions become near-Gaussian, making them efficiently quantizable with 8 centroids (3-bit) per 128-element group. Zero calibration data needed.
The checkpoint stores packed 3-bit indices + per-group norms. The loader handles:
- Per-expert 2D → fused 3D regrouping (gate_proj + up_proj → gate_up_proj fusion)
- Router/gate weight decompression in-place
- Meta-device model creation for low-memory loading
Usage
pip install turboquant-plus-vllm@git+https://github.com/varjoranta/turboquant-vllm.git
from turboquant_vllm import load_tq3_model
model, tokenizer = load_tq3_model("varjosoft/GLM-5.1-Open-TQ3", device="cuda")
# Requires multi-GPU setup — see requirements below
GPU requirements for inference
| Setup | Total VRAM | Per-GPU | Cost/hr (Verda) |
|---|---|---|---|
| 8× A100 80GB | 640 GB | 45 GB | $10.32 |
| 4× H200 141GB | 564 GB | 90 GB | $13.56 |
| 2× B300 262GB | 524 GB | 180 GB | $13.98 |
Without TQ3, the BF16 model requires 1,510 GB VRAM (minimum 8× B300 at $55.92/hr).
Software requirements
transformers >= 5.5.0turboquant-plus-vllm(GitHub)- PyTorch with CUDA
Comparison with other quantizations
| Method | Size | Calibration | Format | Target |
|---|---|---|---|---|
| This (TQ3) | 309 GB (4.9x) | None | Safetensors | GPU serving (vLLM/PyTorch) |
| Unsloth Dynamic 2-bit | 236 GB (6.4x) | 300K+ tokens | GGUF | Local/CPU (llama.cpp) |
| BF16 original | 1,510 GB | N/A | Safetensors | 8× B300+ |
License
MIT (same as base model). Created by Varjosoft Oy.
- Downloads last month
- 567
Model tree for varjosoft/GLM-5.1-Open-TQ3
Base model
zai-org/GLM-5