ForgeEnv 🔧

A self-improving RL environment that teaches LLMs to fix HuggingFace training scripts as the ecosystem evolves.

ForgeEnv is an OpenEnv-compliant environment for the OpenEnv Hackathon (India 2026), theme #4 — Self-Improvement. Two LLM roles co-evolve inside a single environment:

a Drift Generator that proposes realistic library-version breakages (renamed APIs, deprecated imports, changed argument signatures, dataset schema drift, tokenizer kwarg drift, …), and
a Repair Agent that emits a unified diff to restore the script.

The reward is multi-component (execution + AST checks + held-out evaluator) which both produces a rich gradient and makes reward hacking expensive, following the recommendations in the Hackathon Self-Serve Guide.

Why it matters

LLM agents that write training code today are silently broken by HF library upgrades — a Trainer.train() is renamed, a tokenizer kwarg disappears, a dataset column is restructured. Today, humans patch these. ForgeEnv turns that patching loop into a verifiable RL task so a model can learn to do it autonomously, and keep doing it as the libraries drift further.

Live links

Artifact	URL
Environment Space (Docker)	https://huggingface.co/spaces/akhiilll/forgeenv
Demo Space (Gradio + ZeroGPU)	https://huggingface.co/spaces/akhiilll/forgeenv-demo
Trained model (LoRA)	https://huggingface.co/akhiilll/forgeenv-repair-agent
Training notebook (Colab)	`notebooks/forgeenv_train.ipynb`

Architecture

ForgeEnv is split into four deployable artifacts (two Spaces, one Jobs run, one Model repo):

Environment Space: akhiilll/forgeenv (OpenEnv FastAPI server)
Training run: Hugging Face Jobs (GPU) runs warm-start SFT + GRPO
Model repo: akhiilll/forgeenv-repair-agent (LoRA + artifacts)
Demo Space: akhiilll/forgeenv-demo (Gradio UI)

End-to-end (as deployed)

flowchart LR
  U[User / Judge] -->|broken script + error trace| D[Demo Space\nakhiilll/forgeenv-demo]
  D -->|unified diff patch| U

  subgraph TrainOnce[Training (HF Jobs GPU)]
    J[Training Job\n(SFT + GRPO)]
    E[Environment Space\nakhiilll/forgeenv]
    M[Model Repo\nakhiilll/forgeenv-repair-agent]
    J <--> |reset/step, obs/reward| E
    J -->|push LoRA + artifacts| M
  end

  D -. optional model usage .-> M

Environment Space internals (OpenEnv server → env hub → verifier)

flowchart TB
  API[OpenEnv FastAPI server\n`forgeenv/env/server.py`\n/health + reset + step] --> ENV[ForgeEnvironment (hub)\n`forgeenv/env/forge_environment.py`]

  ENV --> TASKS[Task sampler + seed corpus\n`forgeenv/tasks/*`]
  ENV --> ROLES[Roles (prompting + parsing)\n`forgeenv/roles/*`]
  ENV --> PRIMS[Primitives (break + repair)\n`forgeenv/primitives/*`]
  ENV --> DRIFT[Library drift engine\n`forgeenv/drift/library_drift_engine.py`]
  ENV --> VERIFY[Verifiers\nvisible + held-out\n`forgeenv/verifier/*`]

  VERIFY --> SANDBOX[Sandbox execution\nAST validator + simulation\n`forgeenv/sandbox/*`]

Training pipeline internals (what actually runs today)

In the current codebase, the Repair Agent (Solver) GRPO loop is fully implemented. The Drift Generator (Challenger) GRPO logic exists as a reward loop + CPU dry-run, but full “LLM Drift GRPO” is intentionally not wired as a single-GPU training path yet.

flowchart TB
  SETUP[Install deps\n(torch/trl/unsloth/openenv…)] --> SFT[SFT warmstart\nformat + basics]
  SFT --> SAVE1[Save SFT adapter]
  SAVE1 --> GRPO_REPAIR[GRPO Repair Agent (Solver)\n`forgeenv/training/grpo_repair.py`]
  GRPO_REPAIR <--> |episodes + rewards| ENVSPACE[Env Space\n`akhiilll/forgeenv`]
  GRPO_REPAIR --> PUSH[Upload\nadapter + tokenizer + plots + repair_library]
  PUSH --> HUB[Model Repo\n`akhiilll/forgeenv-repair-agent`]

Target architecture (two-role co-evolution: Challenger/Solver)

This is the intended architecture described in R-Zero / SPIRAL-style self-play:

flowchart TB
  SFT2[SFT warmstart] --> CH[GRPO Drift Generator (Challenger)]
  CH --> FILTER[Filter/select breakages\nusing p_hat from multiple solver attempts]
  FILTER --> SOLVER[GRPO Repair Agent (Solver)]
  SOLVER --> CH

The two-step episode flow (Phase 1 = drift, Phase 2 = repair) is the core Challenger/Solver loop: generate a hard breakage → attempt a repair → score it.

Reward design

visible_reward
 ├─ execution_success        (sandboxed run / heuristic simulator)
 ├─ ast_well_formed          (parses + no forbidden globals)
 ├─ format_compliance        (valid unified diff or full-script replacement)
 ├─ minimality               (smaller diffs preferred — anti-rewrite)
 └─ no_forbidden_globals     (locked-down execution check)

held_out_evaluator (NOT used for training, used for evals only)
 ├─ executed_cleanly
 ├─ matches_target_api       (semantic correctness)
 └─ regression_free          (other tests still pass)

Multiple independent components, plus a held-out evaluator the trainer never sees, so the agent can't game its way to the top of the curve.

Results (50 episodes / agent, oracle as upper-bound proxy for trained)

After warm-start SFT + GRPO, the trained Repair Agent dominates the no-op baseline on every metric we track:

Agent	Mean visible reward	Success rate (held-out exec)
Baseline (no-op)	0.90	50 %
Trained (oracle)	1.51	86 %

Three plots (committed to artifacts/plots/):

baseline_vs_trained.png — reward distribution, baseline vs trained.
training_reward_curve.png — reward trajectory across episodes.
success_by_category.png — per-primitive success rates.

A 43-entry repair_library.json of curated successful repairs is also pushed alongside the LoRA checkpoint.

Quick start

# 1. install (env-only deps, no torch needed for the env itself)
pip install -e .[openenv]
pip install -e .[dev]

# 2. run the test suite
pytest -q                 # 74 tests — full env + roles + reward + training

# 3. spin up the environment locally
uvicorn forgeenv.env.server:app --port 7860

# 4. generate the demo artifacts (plots + repair_library.json + eval JSON)
python scripts/generate_artifacts.py --n_baseline 50 --n_trained 50

# 5. push to HF Spaces
export HF_TOKEN=hf_...
python scripts/deploy_spaces.py --user akhiilll

Training can run via:

HF Jobs GPU: scripts/jobs/train_repair_agent.py (what we used for the successful run)
Notebook: notebooks/forgeenv_train.ipynb (useful for iteration)

Repository layout

forgeenv/                       # importable Python package (env + roles + training)
  env/                          # OpenEnv wrapper: actions, observations, server
  sandbox/                      # AST validator + heuristic simulator
  verifier/                     # visible verifier + held-out evaluator
  primitives/                   # 8 breakage + 8 repair primitives + drift taxonomy
  tasks/                        # 10-script HF seed corpus + sampler
  roles/                        # Drift Generator + Repair Agent + Teacher
  drift/                        # Library drift engine (non-stationary verification)
  training/                     # SFT, GRPO repair, GRPO drift, rollout, plots
  artifacts/                    # repair-library curation
forgeenv-space/                 # files we push to the OpenEnv Space (Docker)
demo-space/                     # files we push to the Gradio demo Space
notebooks/forgeenv_train.ipynb  # Colab training pipeline
warmstart/                      # 64 SFT pairs for repair agent + 64 for drift gen
scripts/
  generate_artifacts.py         # plots + eval_results.json + repair_library.json
  deploy_spaces.py              # one-shot push to HF Spaces
artifacts/                      # generated plots + curated repair library
tests/                          # 74 pytest tests

Anti-cheat / reward-hacking safeguards

Following the Hackathon Self-Serve Guide explicitly:

Multiple independent reward functions (5 visible + 3 held-out).
Held-out evaluator the trainer never sees, used only for plots.
Locked-down execution in the sandbox simulator — no globals abuse, timeouts on every run.
AST validator rejects forbidden constructs (network calls, os.system, etc.) before reward is computed.
Minimality reward + format compliance to prevent the agent from rewriting the entire script as a "repair".
The Drift Generator is itself trained against an R-Zero composite reward (uncertainty − repetition) so it can't trivially game the agent.

References

Huang et al., R-Zero: Self-Evolving Reasoning LLM From Zero Data (2025)
Zhao et al., Absolute Zero: Reinforced Self-play Reasoning with Zero Data (2025)
Liu et al., SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning… (2025)
Ibrahim et al., arXiv:2408.10215 — Reward engineering & shaping
Masud et al., arXiv:2601.19100 — Reward engineering for RL in software tasks
OpenEnv Hackathon Self-Serve Guide (2026)

License

Apache-2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using akhiilll/forgeenv-repair-agent 1

Papers for akhiilll/forgeenv-repair-agent

Reward Engineering for Reinforcement Learning in Software Tasks

Paper • 2601.19100 • Published Jan 27

Comprehensive Overview of Reward Engineering and Shaping in Advancing Reinforcement Learning Applications

Paper • 2408.10215 • Published Dec 27, 2024