Towards a Sharp Analysis of Offline Policy Learning for $f$-Divergence-Regularized Contextual Bandits
arXiv:2502.06051v3 Announce Type: replace-cross Abstract: Many offline reinforcement learning algorithms are underpinned by $f$-divergence regularization, but their sample complexity *defined with respect to regularized objectives* still lacks tight analyses, especially in terms of concrete data coverage conditions. In this paper, we study the exact concentrability requirements to achieve the $tilde{Theta}(epsilon^{-1})$ sample complexity for offline $f$-divergence-regularized contextual bandits. For reverse Kullback-Leibler (KL) divergence, arguably the most commonly used one, we achieve an $tilde{O}(epsilon^{-1})$ sample complexity under single-policy concentrability for […]