Batched Stochastic Linear Bandits with 1-Bit Communication Constraints

arXiv:2605.30976v1 Announce Type: new
Abstract: We study stochastic linear bandits under a natural combination of batching and communication constraints: the time horizon is partitioned into batches of equal size $B$, and during each batch the learner sends $B$ requested arm pulls to an agent, who then observes the corresponding $B$ rewards and responds with a single bit of feedback to the learner. For each batch, the learner specifies the 1-bit quantization rule the agent uses, which may depend on all previously received bits but not on any past rewards directly. This setting addresses a significant yet unexplored “middle ground” between previous models having per-round quantization only or total bit budgets only. We establish a minimax lower bound showing that $Omega(Bmin{d,loglvert mathcal{A} rvert})$ regret is unavoidable due to the 1-bit communication bottleneck, even in the absence of noise. Combined with standard statistical limits, this yields a general lower bound of $widetilde{Omega}(Bmin{d,loglvert mathcal{A} rvert} + sqrt{dT min{d,loglvert mathcal{A} rvert}})$. We develop two phased-elimination algorithms based on $G$-optimal designs and 1-bit mean estimation. The first achieves $widetilde{O}(dB + dsqrt{T})$ regret, matching the lower bound up to logarithmic factors when $lvert mathcal{A} rvert = exp(Omega(d))$, and the second incorporates a safe-arm identification and warm-start procedure to obtain $widetilde{O}(Bloglvert mathcal{A} rvert + d^{3/2}sqrt{B} + sqrt{dTloglvert mathcal{A} rvert})$ regret, which is near-optimal in broad scaling regimes of $(lvert mathcal{A} rvert, B, d, T)$. Together, our results demonstrate that a single bit of feedback per batch suffices to nearly match the minimax regret of unconstrained linear bandits in broad scaling regimes, even for batch sizes as large as $Theta(sqrt{T})$.

Liked Liked