Concentrate or Collapse: When Reinforcement Learning Meets Diffusion Language Models for Web Planning

digitado ⋅ 1 de April de 2026

Most AI agents have never failed at anything.

They learn by copying. We show them expert demonstrations, they reproduce the patterns, and we call it training. But a model that has only ever seen success has no concept of what failure looks like, or how close it was to getting things right.

Two final projects I completed this semester for my research courses challenge this from different angles, both in the domain of web form filling: teaching small language models to navigate real websites, fill fields, click buttons, and submit forms.

The first project, “Browser in the Loop” (doi(dot)org/10.13140/RG.2.2.24922.71360), puts an 8-billion-parameter model in a feedback loop with a real browser. Instead of only imitating expert demonstrations, the model generates action plans, executes them against live web forms, and learns from the outcome. The result: reinforcement learning converts near-perfect attempts (all fields correct, submission failed) into actual successes. The gains come not from filling fields better, but from learning to cross the finish line, something imitation alone never optimized for.

The second project, “Concentrate or Collapse” (doi(dot)org/10.13140/RG.2.2.11500.94088), asks a harder question: what if the model does not generate actions left to right at all? Diffusion language models refine entire action sequences in parallel, like a sculptor shaping clay simultaneously from all angles. But applying the same RL that works for autoregressive models causes these diffusion models to collapse. Their outputs degrade to incoherence. Across 16 controlled comparisons, token-level RL improved only twice. The fix required rethinking optimization at the sequence level, where one method (ESPO) finally broke through for pure diffusion architectures.

The thread connecting both: we have been grading AI agents on how well they mimic experts rather than how well they accomplish the actual task. When we shift the objective from “reproduce this demonstration” to “did the form actually get submitted,” the training signal changes fundamentally. And when we change the generation paradigm itself, the RL algorithms we took for granted stop working entirely.

The uncomfortable implication for the field: most web agent benchmarks still evaluate on text similarity to reference trajectories. These projects suggest that what looks correct on paper and what actually works in a browser are different problems, and optimizing for the wrong one leaves performance on the table.

All 12 trained models and their pipeline have been open-sourced here:

Code: github(dot)com/billy-enrizky/openbrowser-ai

Models: huggingface(dot)co/billyenrizky

submitted by /u/Bright_Comedian_7528
[link] [comments]

Like 0

Liked Liked