Instructions to use majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit
Run Hermes
hermes
- MLX LM
How to use majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit", "messages": [ {"role": "user", "content": "Hello"} ] }'
Configuration Parsing Warning:Invalid JSON for config file config.json
Nemotron-3-Nano-30B-A3B - RotorQuant MLX 8-bit
8-bit weight-quantized MLX version of nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with RotorQuant KV-cache quantization. Optimized for Apple Silicon inference via the MLX framework. RotorQuant delivers 5.3x faster prefill and 28% faster decode compared to TurboQuant. Only 3.2B parameters are active per token despite 30.7B total, making this model significantly more efficient at inference time than its parameter count suggests. The hybrid Mamba-2 + Transformer MoE architecture supports up to 1M context length.
Approximate model size: ~30 GB
Model Specifications
| Property | Value |
|---|---|
| Base Model | nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 |
| Parameters | 30.7 billion total (3.2 billion active per token) |
| Architecture | Hybrid Mamba-2 + Transformer MoE (3.2B active per token) |
| Context Length | 1,048,576 tokens (1M) |
| License | NVIDIA Open Model License (commercial use OK) |
| Weight Quantization | 8-bit (~30 GB) |
| KV-Cache Quantization | RotorQuant |
| Framework | MLX (Apple Silicon) |
Quickstart
from mlx_lm import load, generate
from rotorquant import IsoQuantCache
model, tokenizer = load("majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit")
prompt = "Explain the theory of relativity."
response = generate(model, tokenizer, prompt=prompt, max_tokens=512)
print(response)
What is RotorQuant?
RotorQuant applies block-diagonal rotations (Clifford algebra) for KV cache compression. Combined with 8-bit weight quantization in MLX, this provides a dual compression strategy with superior KV-cache performance: smaller model weights plus faster compressed KV cache for efficient long-context generation.
Key advantages over TurboQuant:
- 5.3x faster prefill
- 28% faster decode
- Equivalent memory savings
KV-Cache Quantization Comparison
| Method | Prefill Speed | Decode Speed | Memory Savings | Reference |
|---|---|---|---|---|
| TurboQuant | 1x (baseline) | 1x (baseline) | High | arXiv: 2504.19874 |
| RotorQuant | 5.3x faster | 28% faster | High | GitHub |
Memory Estimates (Nemotron-3-Nano-30B-A3B)
| Precision | Approximate Size | MLX Variant |
|---|---|---|
| BF16 (original) | ~60 GB | -- |
| 8-bit quantized | ~30 GB | This model |
| 4-bit quantized | ~17 GB | RotorQuant-MLX-4bit |
| 2-bit quantized | ~9 GB | RotorQuant-MLX-2bit |
Hardware Requirements
This model requires approximately 30 GB of unified memory. Recommended hardware:
- Apple M2 Ultra (64 GB+)
- Apple M3 Ultra (64 GB+)
- Apple M4 Max (48 GB+)
- Any Apple Silicon Mac with 48 GB+ unified memory
See Also
- nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 -- Base model
- majentik/Nemotron-3-Nano-30B-A3B-RotorQuant -- RotorQuant KV-cache only (transformers)
- majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-4bit -- MLX 4-bit variant
- majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-2bit -- MLX 2-bit variant
- majentik/Nemotron-3-Nano-30B-A3B-TurboQuant-MLX-8bit -- TurboQuant MLX 8-bit variant
- RotorQuant GitHub
- MLX Framework
- Downloads last month
- 65
Quantized
Model tree for majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit
Base model
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16