Scaling Agentic Browsing: A Comparative Study of Supervised vs. Reinforcement Fine-Tuning for Web Form-Filling Agents
Training a language model to automate web browsing is difficult because small mistakes cascade: if the model clicks the wrong button, the entire page changes and all subsequent actions fail. Supervised fine-tuning (SFT) teaches a model to copy human demonstrations, but it never learns how to recover from its own mistakes. We investigate whether reinforcement learning—specifically, Group Relative Policy Optimization (GRPO)—can improve a fine-tuned model by letting it practice on real web forms and learn from the outcomes. We use QLoRA (a memory-efficient fine-tuning method) to adapt Qwen3-8B (an 8-billion-parameter language model) for form-filling tasks on the FormFactory benchmark, which contains 1,250 tasks across eight domains such as healthcare, finance, and legal compliance. Our two-phase pipeline first trains the model via SFT on 992 demonstrations to learn the correct output format, then applies online GRPO where the model generates action plans, executes them in a real browser, and receives a reward score based on how well it filled and submitted the form. On 124 held-out validation tasks, GRPO achieves an average reward of 0.670 compared to SFT’s 0.614—a 9.1% improvement. On a separate 124-example test split, GRPO improves by 5.4% (0.669 vs. 0.635). We find that SFT initialization is necessary: without it, GRPO cannot discover the structured action format and produces zero reward. We also show that naïve SFT with a high learning rate memorizes training patterns and degrades generalization; a careful combination of reduced learning rate, early stopping, and data shuffling yields the best SFT baseline for GRPO to build upon.