DOLCE: Decomposing Off-Policy Evaluation/Learning into Lagged and Current Effects

arXiv:2505.00961v3 Announce Type: replace
Abstract: Off-policy evaluation and learning in contextual bandits use logged interaction data to estimate and optimize the value of a target policy. Most existing methods require sufficient action overlap between the logging and target policies, and violations can bias value and policy gradient estimates. To address this issue, we propose DOLCE (Decomposing Off-policy evaluation/learning into Lagged and Current Effects), which uses only lagged contexts already stored in bandit logs to construct lag-marginalized importance weights and to decompose the objective into a support-robust lagged correction term and a current, model-based term, yielding bias cancellation when the reward-model residual is conditionally mean-zero given the lagged context and action. With multiple candidate lags, DOLCE softly aggregates lag-specific estimates, and we introduce a moment-based training procedure that promotes the desired invariance using only logged lag-augmented data. We show that DOLCE is unbiased in an idealized setting and yields consistent and asymptotically normal estimates with cross-fitting under standard conditions. Our experiments demonstrate that DOLCE achieves substantial improvements in both off-policy evaluation and learning, particularly as the proportion of individuals who violate support increases.

Liked Liked