Scientific Reasoning Training Pipeline

Solving "AI Scientists Produce Results Without Reasoning Scientifically"

Paper: AI scientists produce results without reasoning scientifically (Ríos-García et al., 2026)

This repository contains a complete two-phase training pipeline that directly addresses the four critical failure modes identified in the paper:

Failure Mode	Paper Finding	Our Solution
Evidence Ignored	68% of traces	`evidence_integration_reward` — rewards citing, using, and evaluating evidence
Hypotheses Untested	53% of traces	`hypothesis_testing_reward` — rewards formulating and testing hypotheses
No Belief Revision	26% revision rate	`belief_revision_reward` — rewards updating beliefs when contradicted
No Cross-Checking	Rare	`convergent_validation_reward` — rewards using multiple independent methods

Key Insight

The paper's core finding is that scaffold engineering cannot fix reasoning deficits baked into the base model — the base model accounts for 41.4% of variance vs 1.5% for the scaffold. Our approach follows their recommendation: make reasoning itself a training target.

Architecture

Phase 1: SFT on Scientific Knowledge (Foundation)

Dataset: MegaScience (100K science samples from 1.25M total)
Recipe: Based on MegaScience paper — LR 5e-6, cosine schedule, 3 epochs
Purpose: Build strong scientific knowledge foundation with system prompt encoding the scientific method

Phase 2: GRPO with Epistemic Rewards (Core Innovation)

Method: Group Relative Policy Optimization with 4 custom epistemic reward functions
Recipe: Based on DeepSeek-R1 + Dr. GRPO + RRM
Purpose: Train the model to value the process of scientific reasoning, not just correct answers

The reward is NOT outcome-based (the paper's key criticism). Instead, it rewards epistemic behaviors regardless of answer correctness.

Reward Functions

Each reward function implements a continuous 0-1 scoring rubric:

1. Evidence Integration Reward

0.0 — No mention of evidence
0.3 — Acknowledges evidence exists
0.6 — References specific evidence (numbers, units)
0.8 — Connects evidence to conclusions
1.0 — Evaluates evidence quality and limitations

2. Hypothesis Testing Reward

0.0 — No hypothesis
0.25 — States a hypothesis
0.50 — Identifies falsification criteria
0.75 — Describes a test procedure
1.0 — Evaluates hypothesis against test results

3. Belief Revision Reward

0.0 — No belief change
0.3 — Acknowledges contradiction
0.6 — Explicitly states belief change
0.8 — Provides reasoning for the change
1.0 — Reconciles old and new understanding

4. Convergent Validation Reward

0.0 — Single approach
0.3 — Mentions alternatives exist
0.5 — Uses a second method
0.75 — Compares results from multiple methods
1.0 — Assesses agreement/disagreement

Files

File	Description
`phase1_sft_scientific_reasoning.py`	Phase 1: SFT training on MegaScience
`phase2_grpo_epistemic.py`	Phase 2: GRPO with epistemic rewards
`evaluate_epistemic_reasoning.py`	Evaluation against paper's metrics

Quick Start

Prerequisites

pip install transformers trl torch datasets trackio accelerate peft

Phase 1: SFT Training

# On a10g-largex2 or better (Qwen3-1.7B needs ~8GB, training needs ~24GB with gradients)
accelerate launch phase1_sft_scientific_reasoning.py

Phase 2: GRPO Training

# On a10g-largex2 or better (GRPO needs more memory for G=8 generations)
accelerate launch phase2_grpo_epistemic.py

Evaluation

python evaluate_epistemic_reasoning.py

Literature Basis

This implementation is grounded in 6 published training recipes:

MegaScience (2507.16812) — SFT dataset and hyperparameters for scientific reasoning
DeepSeek-R1 (2501.12948) — GRPO training recipe (G=8, lr=1e-6, β=0.04)
Dr. GRPO (2503.20783) — scale_rewards=False to avoid difficulty bias
Rubric Reward Model (2510.07774) — Continuous 0-1 rewards targeting specific reasoning failures
TACReward (2510.25065) — Activity taxonomy adapted from math to scientific method
SPARK PRM (2512.03244) — Reference-free process rewards for step-level evaluation

Why This Approach Works

The paper identified that 95% of performance variance comes from the base model, not the scaffold. Our approach directly modifies the base model through:

SFT Phase: Teaches the model what scientific reasoning looks like
GRPO Phase: Teaches the model to prefer scientific reasoning through process-based rewards

The key innovation is that our rewards target the epistemic process rather than outcome correctness. This is exactly the gap the paper identifies.

Datasets Used

Dataset	Source	Purpose
MegaScience	MegaScience/MegaScience	SFT training data (science subset)
CORRAL Traces	jablonkagroup/corral-traces	Paper's evaluation traces
CORRAL QAs	jablonkagroup/corral-QAs	Paper's QA benchmark

Citation

@article{riosgarcia2026aiscientists,
  title={AI scientists produce results without reasoning scientifically},
  author={Ríos-García, Martiño and Alampara, Nawaf and Gupta, Chandan and Mandal, Indrajeet and Mannan, Sajid and Aghajani, Ali Asghar and Krishnan, N. M. Anoop and Jablonka, Kevin Maik},
  journal={arXiv preprint arXiv:2604.18805},
  year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for SofiTesfay2010/scientific-reasoning-training

MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning

Paper • 2507.16812 • Published Jul 22, 2025 • 64