Qwen3-4B-ReMax-math-reasoning

This model is a fine-tuned version of Qwen3-4B using ReMax without KL penalty for mathematical reasoning.

Trained with PipelineRL.

Training Details

Datasets

Split Datasets
Train gsm8k_train, math_train
Test gsm8k_test, math_500

RL Algorithm

Parameter Value
Algorithm ReMax
Advantage Baseline Greedy-decoded response reward
Extra Inference 1 deterministic rollout per prompt
Group Structure Not required
Policy Loss ppo
KL Coefficient 0.0
Epsilon (clip) 0.2
Discount Factor (gamma) 1.0
Divide Advantage by Std False
Filter Zero Advantage Groups False
Rollouts per Problem 16

ReMax uses a greedy-decoded response's reward as the baseline for advantages.

Training Hyperparameters

Parameter Value
Base Model Qwen/Qwen3-4B
Learning Rate 1e-06
LR Scheduler cosine
Warmup Steps 25
Max Training Steps 1500
Micro Batch Size 2
Gradient Accumulation 128
Effective Batch Size 256
Sequence Length 8192
Gradient Clipping 0.3
Weight Decay 0.01
Optimizer adamw_torch
Precision bf16
DeepSpeed ZeRO Stage 3

Training Curves

Training Metrics

W&B Run

Full training logs: https://wandb.ai/jaygala24-team/rl-post-training/runs/qwen3_4b_remax_3a1f_4xh100_219047_finetune_78c145f4

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("jaygala24/Qwen3-4B-ReMax-math-reasoning", revision="step-0200")  # optional branch, e.g. "step-0400"
tokenizer = AutoTokenizer.from_pretrained("jaygala24/Qwen3-4B-ReMax-math-reasoning", revision="step-0200")

prompt = "Please reason step by step, and put your final answer within \\boxed{}.\n\nWhat is the sum of 123 and 456?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

vLLM

from vllm import LLM, SamplingParams

llm = LLM(model="jaygala24/Qwen3-4B-ReMax-math-reasoning", revision="step-0200")  # optional branch, e.g. "step-0400"
sampling_params = SamplingParams(temperature=0.7, max_tokens=4096)

prompt = "Please reason step by step, and put your final answer within \boxed{}.

What is the sum of 123 and 456?"
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

Framework

Downloads last month
512
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jaygala24/Qwen3-4B-ReMax-math-reasoning

Finetuned
Qwen/Qwen3-4B
Finetuned
(572)
this model

Collection including jaygala24/Qwen3-4B-ReMax-math-reasoning