Stochastic Matching Bandits with Rare Optimization Updates
arXiv:2509.04194v2 Announce Type: replace
Abstract: We introduce a bandit framework for stochastic matching under the multinomial logit (MNL) choice model. In our setting, $N$ agents on one side are assigned to $K$ arms on the other side, where each arm stochastically selects an agent from its assigned pool according to unknown preferences and yields a corresponding reward over a horizon $T$. The objective is to minimize regret by maximizing the cumulative revenue from successful matches. A naive approach requires solving an NP-hard combinatorial optimization problem at every round, resulting in a prohibitive computational cost. To address this challenge, we propose batched algorithms that strategically limit the number of times matching assignments are updated to $Theta(loglog T)$ over the entire horizon. By invoking expensive combinatorial optimization only on a vanishing fraction of rounds, our algorithms substantially reduce overall computational overhead while still achieving a regret bound of $widetilde{mathcal{O}}(sqrt{T})$.