WIST: Web-Grounded Iterative Self-Play Tree for Domain-Targeted Reasoning Improvement
arXiv:2603.22352v1 Announce Type: new Abstract: Recent progress in reinforcement learning with verifiable rewards (RLVR) offers a practical path to self-improvement of language models, but existing methods face a key trade-off: endogenous self-play can drift over iterations, while corpus-grounded approaches rely on curated data environments. We present textbf{WIST}, a textbf{W}eb-grounded textbf{I}terative textbf{S}elf-play textbf{T}ree framework for domain-targeted reasoning improvement that learns directly from the open web without requiring any pre-arranged domain corpus. WIST incrementally expands a domain tree for exploration, […]