Scientific Reasoning Training Pipeline

Solving "AI Scientists Produce Results Without Reasoning Scientifically"

Paper: AI scientists produce results without reasoning scientifically (Ríos-García et al., 2026)

This repository contains a complete two-phase training pipeline that directly addresses the four critical failure modes identified in the paper:

Failure Mode Paper Finding Our Solution
Evidence Ignored 68% of traces evidence_integration_reward — rewards citing, using, and evaluating evidence
Hypotheses Untested 53% of traces hypothesis_testing_reward — rewards formulating and testing hypotheses
No Belief Revision 26% revision rate belief_revision_reward — rewards updating beliefs when contradicted
No Cross-Checking Rare convergent_validation_reward — rewards using multiple independent methods

Key Insight

The paper's core finding is that scaffold engineering cannot fix reasoning deficits baked into the base model — the base model accounts for 41.4% of variance vs 1.5% for the scaffold. Our approach follows their recommendation: make reasoning itself a training target.

Architecture

Phase 1: SFT on Scientific Knowledge (Foundation)

  • Dataset: MegaScience (100K science samples from 1.25M total)
  • Recipe: Based on MegaScience paper — LR 5e-6, cosine schedule, 3 epochs
  • Purpose: Build strong scientific knowledge foundation with system prompt encoding the scientific method

Phase 2: GRPO with Epistemic Rewards (Core Innovation)

  • Method: Group Relative Policy Optimization with 4 custom epistemic reward functions
  • Recipe: Based on DeepSeek-R1 + Dr. GRPO + RRM
  • Purpose: Train the model to value the process of scientific reasoning, not just correct answers

The reward is NOT outcome-based (the paper's key criticism). Instead, it rewards epistemic behaviors regardless of answer correctness.

Reward Functions

Each reward function implements a continuous 0-1 scoring rubric:

1. Evidence Integration Reward

0.0 — No mention of evidence
0.3 — Acknowledges evidence exists
0.6 — References specific evidence (numbers, units)
0.8 — Connects evidence to conclusions
1.0 — Evaluates evidence quality and limitations

2. Hypothesis Testing Reward

0.0 — No hypothesis
0.25 — States a hypothesis
0.50 — Identifies falsification criteria
0.75 — Describes a test procedure
1.0 — Evaluates hypothesis against test results

3. Belief Revision Reward

0.0 — No belief change
0.3 — Acknowledges contradiction
0.6 — Explicitly states belief change
0.8 — Provides reasoning for the change
1.0 — Reconciles old and new understanding

4. Convergent Validation Reward

0.0 — Single approach
0.3 — Mentions alternatives exist
0.5 — Uses a second method
0.75 — Compares results from multiple methods
1.0 — Assesses agreement/disagreement

Files

File Description
phase1_sft_scientific_reasoning.py Phase 1: SFT training on MegaScience
phase2_grpo_epistemic.py Phase 2: GRPO with epistemic rewards
evaluate_epistemic_reasoning.py Evaluation against paper's metrics

Quick Start

Prerequisites

pip install transformers trl torch datasets trackio accelerate peft

Phase 1: SFT Training

# On a10g-largex2 or better (Qwen3-1.7B needs ~8GB, training needs ~24GB with gradients)
accelerate launch phase1_sft_scientific_reasoning.py

Phase 2: GRPO Training

# On a10g-largex2 or better (GRPO needs more memory for G=8 generations)
accelerate launch phase2_grpo_epistemic.py

Evaluation

python evaluate_epistemic_reasoning.py

Literature Basis

This implementation is grounded in 6 published training recipes:

  1. MegaScience (2507.16812) — SFT dataset and hyperparameters for scientific reasoning
  2. DeepSeek-R1 (2501.12948) — GRPO training recipe (G=8, lr=1e-6, β=0.04)
  3. Dr. GRPO (2503.20783) — scale_rewards=False to avoid difficulty bias
  4. Rubric Reward Model (2510.07774) — Continuous 0-1 rewards targeting specific reasoning failures
  5. TACReward (2510.25065) — Activity taxonomy adapted from math to scientific method
  6. SPARK PRM (2512.03244) — Reference-free process rewards for step-level evaluation

Why This Approach Works

The paper identified that 95% of performance variance comes from the base model, not the scaffold. Our approach directly modifies the base model through:

  1. SFT Phase: Teaches the model what scientific reasoning looks like
  2. GRPO Phase: Teaches the model to prefer scientific reasoning through process-based rewards

The key innovation is that our rewards target the epistemic process rather than outcome correctness. This is exactly the gap the paper identifies.

Datasets Used

Dataset Source Purpose
MegaScience MegaScience/MegaScience SFT training data (science subset)
CORRAL Traces jablonkagroup/corral-traces Paper's evaluation traces
CORRAL QAs jablonkagroup/corral-QAs Paper's QA benchmark

Citation

@article{riosgarcia2026aiscientists,
  title={AI scientists produce results without reasoning scientifically},
  author={Ríos-García, Martiño and Alampara, Nawaf and Gupta, Chandan and Mandal, Indrajeet and Mannan, Sajid and Aghajani, Ali Asghar and Krishnan, N. M. Anoop and Jablonka, Kevin Maik},
  journal={arXiv preprint arXiv:2604.18805},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for SofiTesfay2010/scientific-reasoning-training