Hugging Science

Team

community

Activity Feed Request to join this org

AI & ML interests

None defined yet.

Recent Activity

Smith42 updated a dataset about 5 hours ago

hugging-science/mmu_legacysurvey_dr10_south_21

mubashir1837 published an article 1 day ago

GeneFix-AI: AI-Powered CRISPR-Cas9 System for Real-Time Detection and Correction of Mutations in Non-Human Species

Smith42 updated a dataset 11 days ago

hugging-science/mmu_apogee_dr17

View all activity

Articles

GeneFix-AI: AI-Powered CRISPR-Cas9 System for Real-Time Detection and Correction of Mutations in Non-Human Species

1 day ago

The Pharmome Map: a comprehensive public dataset for drug-target interaction modeling

Nov 18, 2025

• 15

Advancing Predictive ADMET Modeling Through Community-Driven Science: The ExpansionRx-OpenADMET Blind Challenge

Oct 27, 2025

• 12

Promoter-GPT: Writing DNA Instructions with Language Models

Oct 22, 2025

• 25

AI for Food Allergies

Oct 16, 2025

• 32

View all articles

Smith42

updated a dataset about 5 hours ago

hugging-science/mmu_legacysurvey_dr10_south_21

Preview • Updated about 5 hours ago • 67.3k • 1

mubashir1837

published an article 1 day ago

Article

GeneFix-AI: AI-Powered CRISPR-Cas9 System for Real-Time Detection and Correction of Mutations in Non-Human Species

hugging-science

•

1 day ago

aritraroy24

authored a paper 4 days ago

Beyond Text and Tables: Vision-Language Model Integration in ComProScanner for Extracting Materials Data from Scientific Figures with High Accuracy

Paper • 2606.00065 • Published 21 days ago

Smith42

updated a dataset 11 days ago

hugging-science/mmu_apogee_dr17

Viewer • Updated 11 days ago • 720k • 3.15k

Smith42

published a dataset 11 days ago

hugging-science/mmu_apogee_dr17

Viewer • Updated 11 days ago • 720k • 3.15k

Smith42

in hugging-science/mmu_legacysurvey_dr10_south_21 13 days ago

Add README with skymap as hero figure

#2 opened 13 days ago by

tbussozungri

codelion

posted an update 13 days ago

Post

3109

Inspired by the Nemotron Diffusion recipe, check out dhara-250m: a 250M experimental language model that supports three decoding modes from one set of weights: autoregressive, block-diffusion, and self-speculation.

It is small, easy to try, and meant for exploring diffusion-style decoding and latency tradeoffs in compact LMs.

Model: codelion/dhara-250m

Try the chat demo here: codelion/dhara-chat

3 replies

tom-hehir

updated a dataset 18 days ago

hugging-science/mmu_manga

Updated 18 days ago • 6.49k

pedrocurvo

authored 2 papers 21 days ago

MSPT: Efficient Large-Scale Physical Modeling via Parallelized Multi-Scale Attention

Paper • 2512.01738 • Published Mar 9 • 1

Follow the Mean: Reference-Guided Flow Matching

Paper • 2605.10302 • Published 28 days ago • 5

pedrocurvo

submitted a paper to Daily Papers 22 days ago

Follow the Mean: Reference-Guided Flow Matching

Paper • 2605.10302 • Published 28 days ago • 5

cgeorgiaw

updated a dataset 23 days ago

hugging-science/m-boltz-submissions

Viewer • Updated 23 days ago • 10 • 35

specimba

updated a model 24 days ago

hugging-science/sulphur_prompt_enhancer-Q4_K_M-imatrix.gguf

Updated 24 days ago • 251 • 1

specimba

published a model 24 days ago

hugging-science/sulphur_prompt_enhancer-Q4_K_M-imatrix.gguf

Updated 24 days ago • 251 • 1

aritraroy24

authored a paper 27 days ago

From Knowledge to Action: Outcomes of the 2025 Large Language Model (LLM) Hackathon for Applications in Materials Science and Chemistry

Paper • 2605.03205 • Published May 4

codelion

posted an update 3 months ago

Post

3415

Scaling Pedagogical Pre-training to 10 Billion Tokens

New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself.

We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse.

The result is codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry.

We trained codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours.

Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through.

Full writeup with comparisons against 7 other datasets, detailed benchmark breakdowns, and connections to recent work on synthetic data scaling, curriculum learning, and data mixing laws: https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens

All datasets at multiple scales (10M, 100M, 1B, 10B) plus seed concepts and an SFT variant are in the Sutra Pedagogical Datasets collection.

2 replies

BioMike

authored a paper 3 months ago

The Million-Label NER: Breaking Scale Barriers with GLiNER bi-encoder

Paper • 2602.18487 • Published Feb 11 • 6

AyushM6

authored a paper 4 months ago

MAEB: Massive Audio Embedding Benchmark

Paper • 2602.16008 • Published Feb 17 • 25

codelion

posted an update 5 months ago

Post

3280

Reverse Engineering a $500M Mystery: From HashHop to Memory-Augmented Language Models

I wrote a deep dive into how Magic AI's 100M token context window might work, starting from their HashHop benchmark and building up to MALM - a Memory-Augmented Language Model.

Key insight: treating each key as a single token enables perfect retrieval at unlimited context lengths.

The article covers:

- How HashHop works and why its perfect accuracy is suspicious
- Building a tokenized solver that achieves 100% accuracy
- Scaling to MALM for real code search tasks
- Why this approach could handle 100M+ tokens

Read the full article: https://huggingface.co/blog/codelion/reverse-engineering-magic-hashhop

Try the model: codelion/malm-165m

Code: https://github.com/codelion/hash-hop

1 reply

Nionio

authored a paper 5 months ago

MMGP: a Mesh Morphing Gaussian Process-based machine learning method for regression of physical problems under non-parameterized geometrical variability

Paper • 2305.12871 • Published May 22, 2023

AI & ML interests

Recent Activity

Articles

GeneFix-AI: AI-Powered CRISPR-Cas9 System for Real-Time Detection and Correction of Mutations in Non-Human Species

Agentic Scientific Machine Learning for Neural Operators

How to Build a Benchmark with a Private Test Set on Hugging Face

Why You Should Care About Partial Differential Equations (PDEs)

SARLO-80: Worldwide Slant SAR Language Optic Dataset at 80 cm Resolution

The Pharmome Map: a comprehensive public dataset for drug-target interaction modeling

Advancing Predictive ADMET Modeling Through Community-Driven Science: The ExpansionRx-OpenADMET Blind Challenge

Promoter-GPT: Writing DNA Instructions with Language Models

AI for Food Allergies

Team members 1,174

hugging-science's activity

GeneFix-AI: AI-Powered CRISPR-Cas9 System for Real-Time Detection and Correction of Mutations in Non-Human Species

Add README with skymap as hero figure