arxiv:2510.00977

It Takes Two: Your GRPO Is Secretly DPO

Published on Oct 1, 2025

· Submitted by

Yihong Wu on Oct 2, 2025

Mila – Quebec Artificial Intelligence Institute

Upvote

Authors:

Liheng Ma ,

Xinyu Wang ,

Zhan Su ,

Abstract

Reframing Group Relative Policy Optimization as contrastive learning reveals its connection to Direct Preference Optimization, enabling minimal two-rollout GRPO to achieve performance comparable to larger group sizes with reduced computational cost.

AI-generated summary

Group Relative Policy Optimization (GRPO) is a prominent reinforcement learning algorithm for post-training Large Language Models (LLMs). It is commonly believed that GRPO necessitates a large group size to ensure stable training via precise statistical estimation, which incurs substantial computational overhead. In this work, we challenge this assumption by reframing GRPO as a form of contrastive learning, which reveals a fundamental connection to Direct Preference Optimization (DPO). Motivated by DPO's empirical success, we investigate the minimal two-rollout case (2-GRPO), a configuration previously deemed infeasible. We provide a rigorous theoretical analysis to validate 2-GRPO and demonstrate empirically that it achieves performance on par with 16-GRPO, despite using only 1/8 of the rollouts and reducing training time by over 70%.

View arXiv page View PDF Add to collection

Community

Yihong7788

Paper submitter Oct 2, 2025

pazyorkcc

Oct 16, 2025

•

edited Oct 16, 2025

I suggest you read this paper https://arxiv.org/abs/2506.10947

librarian-bot

Oct 3, 2025

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

sheliak

about 7 hours ago

•

edited about 7 hours ago

I'm confused as to how your experimental results are supposed to show the validity of the method. Going from a batch size of 32 and LR of 1e-6 to a batch size of 256 and LR of 8e-6 is a gargantuan difference which completely changes the nature of the training run.

A more realistic test would have been training with the same 256 batch size, 8e-6 LR, and group size 2 in both experiments with the -1,0,1 advantage values being the only change.

When normalizing reward values in standard GRPO with a group size of 2 and the batch standard deviation, you also get positive/negative advantage pairs, but with magnitude information, which would seem to be inherently superior to -1/1 values. The difference may also be that the baseline advantage values tend to be less than 1 in magnitude, so forcing them to -1/1 could lead to faster convergence due to the sheer size of the advantage being different.

I attempted to replicate this on Mistral Nemo, training in Unsloth with a modified TRL GRPOTrainer to accomodate the -1,0,1 advantage values on a 24GB card and got far worse convergence when keeping all variables the same besides the number of generations per prompt.