Configuration Parsing Warning:Invalid JSON for config file config.json

Nemotron-3-Nano-30B-A3B - RotorQuant MLX 8-bit

8-bit weight-quantized MLX version of nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with RotorQuant KV-cache quantization. Optimized for Apple Silicon inference via the MLX framework. RotorQuant delivers 5.3x faster prefill and 28% faster decode compared to TurboQuant. Only 3.2B parameters are active per token despite 30.7B total, making this model significantly more efficient at inference time than its parameter count suggests. The hybrid Mamba-2 + Transformer MoE architecture supports up to 1M context length.

Approximate model size: ~30 GB

Model Specifications

Property Value
Base Model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Parameters 30.7 billion total (3.2 billion active per token)
Architecture Hybrid Mamba-2 + Transformer MoE (3.2B active per token)
Context Length 1,048,576 tokens (1M)
License NVIDIA Open Model License (commercial use OK)
Weight Quantization 8-bit (~30 GB)
KV-Cache Quantization RotorQuant
Framework MLX (Apple Silicon)

Quickstart

from mlx_lm import load, generate
from rotorquant import IsoQuantCache

model, tokenizer = load("majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit")

prompt = "Explain the theory of relativity."
response = generate(model, tokenizer, prompt=prompt, max_tokens=512)
print(response)

What is RotorQuant?

RotorQuant applies block-diagonal rotations (Clifford algebra) for KV cache compression. Combined with 8-bit weight quantization in MLX, this provides a dual compression strategy with superior KV-cache performance: smaller model weights plus faster compressed KV cache for efficient long-context generation.

Key advantages over TurboQuant:

  • 5.3x faster prefill
  • 28% faster decode
  • Equivalent memory savings

KV-Cache Quantization Comparison

Method Prefill Speed Decode Speed Memory Savings Reference
TurboQuant 1x (baseline) 1x (baseline) High arXiv: 2504.19874
RotorQuant 5.3x faster 28% faster High GitHub

Memory Estimates (Nemotron-3-Nano-30B-A3B)

Precision Approximate Size MLX Variant
BF16 (original) ~60 GB --
8-bit quantized ~30 GB This model
4-bit quantized ~17 GB RotorQuant-MLX-4bit
2-bit quantized ~9 GB RotorQuant-MLX-2bit

Hardware Requirements

This model requires approximately 30 GB of unified memory. Recommended hardware:

  • Apple M2 Ultra (64 GB+)
  • Apple M3 Ultra (64 GB+)
  • Apple M4 Max (48 GB+)
  • Any Apple Silicon Mac with 48 GB+ unified memory

See Also

Downloads last month
65
Safetensors
Model size
32B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit

Finetuned
(43)
this model

Paper for majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit