Abstract
Reframing Group Relative Policy Optimization as contrastive learning reveals its connection to Direct Preference Optimization, enabling minimal two-rollout GRPO to achieve performance comparable to larger group sizes with reduced computational cost.
Group Relative Policy Optimization (GRPO) is a prominent reinforcement learning algorithm for post-training Large Language Models (LLMs). It is commonly believed that GRPO necessitates a large group size to ensure stable training via precise statistical estimation, which incurs substantial computational overhead. In this work, we challenge this assumption by reframing GRPO as a form of contrastive learning, which reveals a fundamental connection to Direct Preference Optimization (DPO). Motivated by DPO's empirical success, we investigate the minimal two-rollout case (2-GRPO), a configuration previously deemed infeasible. We provide a rigorous theoretical analysis to validate 2-GRPO and demonstrate empirically that it achieves performance on par with 16-GRPO, despite using only 1/8 of the rollouts and reducing training time by over 70%.
Community
Group Relative Policy Optimization (GRPO) is a prominent reinforcement learning algorithm for post-training Large Language Models (LLMs). It is commonly believed that GRPO necessitates a large group size to ensure stable training via precise statistical estimation, which incurs substantial computational overhead. In this work, we challenge this assumption by reframing GRPO as a form of contrastive learning, which reveals a fundamental connection to Direct Preference Optimization (DPO). Motivated by DPO's empirical success, we investigate the minimal two-rollout case (2-GRPO), a configuration previously deemed infeasible. We provide a rigorous theoretical analysis to validate 2-GRPO and demonstrate empirically that it achieves performance on par with 16-GRPO, despite using only 1/8 of the rollouts and reducing training time by over 70%.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Improving Sampling Efficiency in RLVR through Adaptive Rollout and Response Reuse (2025)
- Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends (2025)
- Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation (2025)
- NGRPO: Negative-enhanced Group Relative Policy Optimization (2025)
- COPO: Consistency-Aware Policy Optimization (2025)
- No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping (2025)
- Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
I'm confused as to how your experimental results are supposed to show the validity of the method. Going from a batch size of 32 and LR of 1e-6 to a batch size of 256 and LR of 8e-6 is a gargantuan difference which completely changes the nature of the training run.
A more realistic test would have been training with the same 256 batch size, 8e-6 LR, and group size 2 in both experiments with the -1,0,1 advantage values being the only change.
When normalizing reward values in standard GRPO with a group size of 2 and the batch standard deviation, you also get positive/negative advantage pairs, but with magnitude information, which would seem to be inherently superior to -1/1 values. The difference may also be that the baseline advantage values tend to be less than 1 in magnitude, so forcing them to -1/1 could lead to faster convergence due to the sheer size of the advantage being different.
I attempted to replicate this on Mistral Nemo, training in Unsloth with a modified TRL GRPOTrainer to accomodate the -1,0,1 advantage values on a 24GB card and got far worse convergence when keeping all variables the same besides the number of generations per prompt.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper