Solver-in-the-Loop: MDP-Based Benchmarks for Self-Correction and Behavioral Rationality in Operations Research
arXiv:2601.21008v1 Announce Type: new
Abstract: Operations Research practitioners routinely debug infeasible models through an iterative process: analyzing Irreducible Infeasible Subsystems (IIS{}), identifying constraint conflicts, and systematically repairing formulations until feasibility is achieved. Yet existing LLM benchmarks evaluate OR as one-shot translation — given a problem description, generate solver code — ignoring this diagnostic loop entirely. We introduce two benchmarks that place the textbf{solver in the evaluation loop}. textbf{ORDebug{}} evaluates iterative self-correction through 5,000+ problems spanning 9 error types; each repair action triggers solver re-execution and IIS{} recomputation, providing deterministic, verifiable feedback. textbf{ORBias{}} evaluates behavioral rationality through 2,000 newsvendor instances (1,000 ID + 1,000 OOD), measuring systematic deviations from closed-form optimal policies. Across 26 models and 12,000+ samples, we find that domain-specific RLVR training enables an 8B model to surpass frontier APIs: 95.3% vs 86.2% recovery rate (+9.1%), 62.4% vs 47.8% diagnostic accuracy (+14.6%), and 2.25 vs 3.78 steps to resolution (1.7$times$ faster). On ORBias{}, curriculum training achieves the only negative ID$rightarrow$OOD bias drift among models evaluated (-9.6%), reducing systematic bias by 48% (from 20.0% to 10.4%). These results demonstrate that process-level evaluation with verifiable oracles enables targeted training that outperforms scale.