AI & ML interests
None defined yet.
Recent Activity
Articles
Smith42
updated a
dataset about 5 hours ago
mubashir1837
published an article 1 day ago
Article
GeneFix-AI: AI-Powered CRISPR-Cas9 System for Real-Time Detection and Correction of Mutations in Non-Human Species
hugging-science
• aritraroy24
authored a
paper 4 days ago
Smith42
updated a
dataset 11 days ago
Smith42
published a
dataset 11 days ago
Add README with skymap as hero figure
#2 opened 13 days ago
by
tbussozungri
Post
3109
Inspired by the Nemotron Diffusion recipe, check out dhara-250m: a 250M experimental language model that supports three decoding modes from one set of weights: autoregressive, block-diffusion, and self-speculation.
It is small, easy to try, and meant for exploring diffusion-style decoding and latency tradeoffs in compact LMs.
Model: codelion/dhara-250m
Try the chat demo here: codelion/dhara-chat
It is small, easy to try, and meant for exploring diffusion-style decoding and latency tradeoffs in compact LMs.
Model: codelion/dhara-250m
Try the chat demo here: codelion/dhara-chat
tom-hehir
updated a
dataset 18 days ago
pedrocurvo
authored 2
papers 21 days ago
pedrocurvo
submitted a
paper to Daily Papers 22 days ago
cgeorgiaw
updated a
dataset 23 days ago
specimba
updated a
model 24 days ago
specimba
published a
model 24 days ago
aritraroy24
authored a
paper 27 days ago
Post
3415
Scaling Pedagogical Pre-training to 10 Billion Tokens
New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself.
We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse.
The result is codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry.
We trained codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours.
Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through.
Full writeup with comparisons against 7 other datasets, detailed benchmark breakdowns, and connections to recent work on synthetic data scaling, curriculum learning, and data mixing laws: https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens
All datasets at multiple scales (10M, 100M, 1B, 10B) plus seed concepts and an SFT variant are in the Sutra Pedagogical Datasets collection.
New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself.
We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse.
The result is codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry.
We trained codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours.
Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through.
Full writeup with comparisons against 7 other datasets, detailed benchmark breakdowns, and connections to recent work on synthetic data scaling, curriculum learning, and data mixing laws: https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens
All datasets at multiple scales (10M, 100M, 1B, 10B) plus seed concepts and an SFT variant are in the Sutra Pedagogical Datasets collection.
BioMike
authored a
paper 3 months ago
AyushM6
authored a
paper 4 months ago
Post
3280
Reverse Engineering a $500M Mystery: From HashHop to Memory-Augmented Language Models
I wrote a deep dive into how Magic AI's 100M token context window might work, starting from their HashHop benchmark and building up to MALM - a Memory-Augmented Language Model.
Key insight: treating each key as a single token enables perfect retrieval at unlimited context lengths.
The article covers:
- How HashHop works and why its perfect accuracy is suspicious
- Building a tokenized solver that achieves 100% accuracy
- Scaling to MALM for real code search tasks
- Why this approach could handle 100M+ tokens
Read the full article: https://huggingface.co/blog/codelion/reverse-engineering-magic-hashhop
Try the model: codelion/malm-165m
Code: https://github.com/codelion/hash-hop
I wrote a deep dive into how Magic AI's 100M token context window might work, starting from their HashHop benchmark and building up to MALM - a Memory-Augmented Language Model.
Key insight: treating each key as a single token enables perfect retrieval at unlimited context lengths.
The article covers:
- How HashHop works and why its perfect accuracy is suspicious
- Building a tokenized solver that achieves 100% accuracy
- Scaling to MALM for real code search tasks
- Why this approach could handle 100M+ tokens
Read the full article: https://huggingface.co/blog/codelion/reverse-engineering-magic-hashhop
Try the model: codelion/malm-165m
Code: https://github.com/codelion/hash-hop
Nionio
authored a
paper 5 months ago