Scientific Reasoning Training Pipeline
Solving "AI Scientists Produce Results Without Reasoning Scientifically"
Paper: AI scientists produce results without reasoning scientifically (RÃos-GarcÃa et al., 2026)
This repository contains a complete two-phase training pipeline that directly addresses the four critical failure modes identified in the paper:
| Failure Mode | Paper Finding | Our Solution |
|---|---|---|
| Evidence Ignored | 68% of traces | evidence_integration_reward — rewards citing, using, and evaluating evidence |
| Hypotheses Untested | 53% of traces | hypothesis_testing_reward — rewards formulating and testing hypotheses |
| No Belief Revision | 26% revision rate | belief_revision_reward — rewards updating beliefs when contradicted |
| No Cross-Checking | Rare | convergent_validation_reward — rewards using multiple independent methods |
Key Insight
The paper's core finding is that scaffold engineering cannot fix reasoning deficits baked into the base model — the base model accounts for 41.4% of variance vs 1.5% for the scaffold. Our approach follows their recommendation: make reasoning itself a training target.
Architecture
Phase 1: SFT on Scientific Knowledge (Foundation)
- Dataset: MegaScience (100K science samples from 1.25M total)
- Recipe: Based on MegaScience paper — LR 5e-6, cosine schedule, 3 epochs
- Purpose: Build strong scientific knowledge foundation with system prompt encoding the scientific method
Phase 2: GRPO with Epistemic Rewards (Core Innovation)
- Method: Group Relative Policy Optimization with 4 custom epistemic reward functions
- Recipe: Based on DeepSeek-R1 + Dr. GRPO + RRM
- Purpose: Train the model to value the process of scientific reasoning, not just correct answers
The reward is NOT outcome-based (the paper's key criticism). Instead, it rewards epistemic behaviors regardless of answer correctness.
Reward Functions
Each reward function implements a continuous 0-1 scoring rubric:
1. Evidence Integration Reward
0.0 — No mention of evidence
0.3 — Acknowledges evidence exists
0.6 — References specific evidence (numbers, units)
0.8 — Connects evidence to conclusions
1.0 — Evaluates evidence quality and limitations
2. Hypothesis Testing Reward
0.0 — No hypothesis
0.25 — States a hypothesis
0.50 — Identifies falsification criteria
0.75 — Describes a test procedure
1.0 — Evaluates hypothesis against test results
3. Belief Revision Reward
0.0 — No belief change
0.3 — Acknowledges contradiction
0.6 — Explicitly states belief change
0.8 — Provides reasoning for the change
1.0 — Reconciles old and new understanding
4. Convergent Validation Reward
0.0 — Single approach
0.3 — Mentions alternatives exist
0.5 — Uses a second method
0.75 — Compares results from multiple methods
1.0 — Assesses agreement/disagreement
Files
| File | Description |
|---|---|
phase1_sft_scientific_reasoning.py |
Phase 1: SFT training on MegaScience |
phase2_grpo_epistemic.py |
Phase 2: GRPO with epistemic rewards |
evaluate_epistemic_reasoning.py |
Evaluation against paper's metrics |
Quick Start
Prerequisites
pip install transformers trl torch datasets trackio accelerate peft
Phase 1: SFT Training
# On a10g-largex2 or better (Qwen3-1.7B needs ~8GB, training needs ~24GB with gradients)
accelerate launch phase1_sft_scientific_reasoning.py
Phase 2: GRPO Training
# On a10g-largex2 or better (GRPO needs more memory for G=8 generations)
accelerate launch phase2_grpo_epistemic.py
Evaluation
python evaluate_epistemic_reasoning.py
Literature Basis
This implementation is grounded in 6 published training recipes:
- MegaScience (2507.16812) — SFT dataset and hyperparameters for scientific reasoning
- DeepSeek-R1 (2501.12948) — GRPO training recipe (G=8, lr=1e-6, β=0.04)
- Dr. GRPO (2503.20783) —
scale_rewards=Falseto avoid difficulty bias - Rubric Reward Model (2510.07774) — Continuous 0-1 rewards targeting specific reasoning failures
- TACReward (2510.25065) — Activity taxonomy adapted from math to scientific method
- SPARK PRM (2512.03244) — Reference-free process rewards for step-level evaluation
Why This Approach Works
The paper identified that 95% of performance variance comes from the base model, not the scaffold. Our approach directly modifies the base model through:
- SFT Phase: Teaches the model what scientific reasoning looks like
- GRPO Phase: Teaches the model to prefer scientific reasoning through process-based rewards
The key innovation is that our rewards target the epistemic process rather than outcome correctness. This is exactly the gap the paper identifies.
Datasets Used
| Dataset | Source | Purpose |
|---|---|---|
| MegaScience | MegaScience/MegaScience | SFT training data (science subset) |
| CORRAL Traces | jablonkagroup/corral-traces | Paper's evaluation traces |
| CORRAL QAs | jablonkagroup/corral-QAs | Paper's QA benchmark |
Citation
@article{riosgarcia2026aiscientists,
title={AI scientists produce results without reasoning scientifically},
author={RÃos-GarcÃa, Martiño and Alampara, Nawaf and Gupta, Chandan and Mandal, Indrajeet and Mannan, Sajid and Aghajani, Ali Asghar and Krishnan, N. M. Anoop and Jablonka, Kevin Maik},
journal={arXiv preprint arXiv:2604.18805},
year={2026}
}