Text Generation
MLX
Safetensors
English
gpt_oss
apple-silicon
Mixture of Experts
mixture-of-experts
4-bit precision
quantized
gpt-oss
context-retrieval
Eval Results (legacy)
Instructions to use foadmk/context-1-MLX-MXFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use foadmk/context-1-MLX-MXFP4 with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # if on a CUDA device, also pip install mlx[cuda] # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("foadmk/context-1-MLX-MXFP4") prompt = "Once upon a time in" text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- MLX LM
How to use foadmk/context-1-MLX-MXFP4 with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Generate some text mlx_lm.generate --model "foadmk/context-1-MLX-MXFP4" --prompt "Once upon a time"
chromadb/context-1 MLX MXFP4
This model was converted from chromadb/context-1 to MLX format with MXFP4 (4-bit) quantization for efficient inference on Apple Silicon.
Model Description
- Base Model: chromadb/context-1 (fine-tuned from openai/gpt-oss-20b)
- Architecture: 20B parameter Mixture of Experts (MoE) with 32 experts, 4 active per token
- Format: MLX with MXFP4 quantization
- Quantization: 4.504 bits per weight
Performance (Apple M1 Max, 64GB)
| Metric | Value |
|---|---|
| Model Size | 11 GB |
| Peak Memory | 12 GB |
| Generation Speed | ~69 tokens/sec |
| Prompt Processing | ~70 tokens/sec |
| Latency | ~14.5 ms/token |
Usage
from mlx_lm import load, generate
model, tokenizer = load("foadmk/context-1-MLX-MXFP4")
response = generate(model, tokenizer, prompt="What is the capital of France?", max_tokens=100, verbose=True)
Conversion Notes
The chromadb/context-1 model uses a different weight format than the original openai/gpt-oss-20b, which required custom conversion logic:
Key Differences from Original Format
- Dense BF16 tensors (not quantized blocks with
_blockssuffix) - gate_up_proj shape:
(experts, hidden, intermediate*2)with interleaved gate/up weights
Weight Transformations Applied
gate_up_proj
(32, 2880, 5760):- Transpose to
(32, 5760, 2880) - Interleaved split:
[:, ::2, :]for gate,[:, 1::2, :]for up - Result:
gate_proj.weightandup_proj.weighteach(32, 2880, 2880)
- Transpose to
down_proj
(32, 2880, 2880):- Transpose to match MLX expected format
Bypass mlx_lm sanitize: Pre-naming weights with
.weightsuffix to skip incorrect splitting
Conversion Script
A conversion script is included in this repo: convert_context1_to_mlx.py
python convert_context1_to_mlx.py --output ./context1-mlx-mxfp4
Intended Use
This model is optimized for:
- Context-aware retrieval and search tasks
- Running locally on Apple Silicon Macs
- Low-latency inference without GPU requirements
Limitations
- Requires Apple Silicon Mac with MLX support
- Best performance on M1 Pro/Max/Ultra or newer with 32GB+ RAM
- Model outputs structured JSON-like responses (inherited from base model training)
Citation
If you use this model, please cite the original:
@misc{chromadb-context-1,
author = {Chroma},
title = {Context-1: A Fine-tuned GPT-OSS Model for Retrieval},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/chromadb/context-1}
}
Acknowledgments
- chromadb for the original context-1 model
- OpenAI for the gpt-oss-20b base model
- Apple MLX team for the MLX framework
- mlx-community for MLX model conversion tools
- Downloads last month
- 30
Model size
21B params
Tensor type
BF16
·
U32 ·
Hardware compatibility
Log In to add your hardware
4-bit
Model tree for foadmk/context-1-MLX-MXFP4
Evaluation results
- Tokens per second (M1 Max)self-reported69.000
- Peak Memory (GB)self-reported12.000