Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR

digitado ⋅ 12 de January de 2026

arXiv:2511.01937v2 Announce Type: replace-cross
Abstract: Large language models (LLMs) trained for step-by-step reasoning often become excessively verbose, raising inference cost. Standard Reinforcement Learning with Verifiable Rewards (RLVR) pipelines filter out “easy” problems for training efficiency, leaving the model to train primarily on harder problems that require longer reasoning chains. This skews the output length distribution upward, resulting in a textbf{model that conflates “thinking longer” with “thinking better”}. In this work, we show that retaining and modestly up-weighting moderately easy problems acts as an implicit length regularizer. Exposing the model to solvable short-chain tasks constrains its output distribution and prevents runaway verbosity. The result is textbf{emph{emergent brevity for free}}: the model learns to solve harder problems without inflating the output length, textbf{ despite the absence of any explicit length penalization}. RLVR experiments using this approach on textit{Qwen3-4B-Thinking-2507} (with a 16k token limit) achieve baseline pass@1 AIME25 accuracy while generating solutions that are, on average, nearly twice as short. The code is available at href{https://github.com/MBZUAI-Paris/Frugal-AI}{GitHub}, with datasets and models on href{https://huggingface.co/collections/MBZUAI-Paris/k2-think-mini-68dcfa8b114686a4bd3dc2bc}{Hugging Face}.

Like 0

Liked Liked