SPARK-Code · Three-Adapter Demo
Interactive demo of three LoRA adapters for Qwen2.5-Coder-3B-Instruct trained on MBPP with execution-grounded GRPO, evaluated on HumanEval and a held-out MBPP slice.
- A (Exec-only GRPO) — model card — strongest baseline; +0.85 pp HumanEval pass@1 with bounded KL.
- C-light (Naive Co-Evolve) — model card — demonstrates the policy-drift failure mode (−2.3 pp on HumanEval).
- C-reg (Regularized Co-Evolve) — model card — bounded drift; matches the baseline on HumanEval and gains +4 pp on MBPP pass@5.
Key finding: C-light demonstrates policy drift; C-reg recovers via lower aux_loss_scale and higher kl_coeff.
Source code: https://github.com/amarsaikhanb/spark-code
ZeroGPU cold start is ~30s on the first request after idle.
| Prompt | Test cases (optional, Python asserts) |
|---|
Runs the same prompt through all four conditions (sequentially). Max tokens is capped at 512 here to stay within the ZeroGPU window.
A (Exec-only GRPO)
C-light (Naive Co-Evolve)
C-reg (Regularized Co-Evolve)
Base (no adapter)
Inspect the saved per-problem eval results. Select a benchmark, iteration, and a specific problem to see how each condition's trained adapter performed on it. At iter 0 all three conditions share the untrained-base baseline; differences emerge from iter 1 on.
Select a problem.