Instructions to use majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Pi new

How to use majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit

Run Hermes

hermes

MLX LM

How to use majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Configuration Parsing Warning:Invalid JSON for config file config.json

Nemotron-3-Nano-30B-A3B - RotorQuant MLX 8-bit

8-bit weight-quantized MLX version of nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with RotorQuant KV-cache quantization. Optimized for Apple Silicon inference via the MLX framework. RotorQuant delivers 5.3x faster prefill and 28% faster decode compared to TurboQuant. Only 3.2B parameters are active per token despite 30.7B total, making this model significantly more efficient at inference time than its parameter count suggests. The hybrid Mamba-2 + Transformer MoE architecture supports up to 1M context length.

Approximate model size: ~30 GB

Model Specifications

Property	Value
Base Model	nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Parameters	30.7 billion total (3.2 billion active per token)
Architecture	Hybrid Mamba-2 + Transformer MoE (3.2B active per token)
Context Length	1,048,576 tokens (1M)
License	NVIDIA Open Model License (commercial use OK)
Weight Quantization	8-bit (~30 GB)
KV-Cache Quantization	RotorQuant
Framework	MLX (Apple Silicon)

Quickstart

from mlx_lm import load, generate
from rotorquant import IsoQuantCache

model, tokenizer = load("majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit")

prompt = "Explain the theory of relativity."
response = generate(model, tokenizer, prompt=prompt, max_tokens=512)
print(response)

What is RotorQuant?

RotorQuant applies block-diagonal rotations (Clifford algebra) for KV cache compression. Combined with 8-bit weight quantization in MLX, this provides a dual compression strategy with superior KV-cache performance: smaller model weights plus faster compressed KV cache for efficient long-context generation.

Key advantages over TurboQuant:

5.3x faster prefill
28% faster decode
Equivalent memory savings

KV-Cache Quantization Comparison

Method	Prefill Speed	Decode Speed	Memory Savings	Reference
TurboQuant	1x (baseline)	1x (baseline)	High	arXiv: 2504.19874
RotorQuant	5.3x faster	28% faster	High	GitHub

Memory Estimates (Nemotron-3-Nano-30B-A3B)

Precision	Approximate Size	MLX Variant
BF16 (original)	~60 GB	--
8-bit quantized	~30 GB	This model
4-bit quantized	~17 GB	RotorQuant-MLX-4bit
2-bit quantized	~9 GB	RotorQuant-MLX-2bit

Hardware Requirements

This model requires approximately 30 GB of unified memory. Recommended hardware:

Apple M2 Ultra (64 GB+)
Apple M3 Ultra (64 GB+)
Apple M4 Max (48 GB+)
Any Apple Silicon Mac with 48 GB+ unified memory

Model tree for majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit

Base model

nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

Finetuned

(43)

this model

Paper for majentik/Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Paper • 2504.19874 • Published Apr 28, 2025 • 34

majentik
/

Nemotron-3-Nano-30B-A3B-RotorQuant-MLX-8bit