A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies
arXiv:2510.16132v2 Announce Type: replace-cross
Abstract: In this work, we present the first finite-time analysis of Q-learning with time-varying learning policies (i.e., on-policy sampling) for discounted Markov decision processes under minimal assumptions, requiring only the existence of a policy that induces an irreducible Markov chain over the state space. We establish a last-iterate convergence rate for $mathbb{E}[|Q_k – Q^*|_infty^2]$, implying a sample complexity of order $mathcal{O}(1/xi^2)$ for achieving $mathbb{E}[|Q_k – Q^*|_infty]le xi$. This matches the rate of off-policy Q-learning, but with worse dependence on exploration-related parameters. We also derive a finite-time rate for $mathbb{E}[|Q^{pi_k} – Q^*|_infty^2]$, where $pi_k$ is the learning policy at iteration $k$, highlighting the exploration-exploitation trade-off in on-policy Q-learning. While exploration is weaker than in off-policy methods, on-policy learning enjoys an exploitation advantage as the learning policy converges to an optimal one. Numerical results support our theory. Technically, rapidly time-varying learning policies induce time-inhomogeneous Markovian noise, creating significant analytical challenges under minimal exploration. To address this, we develop a Poisson-equation-based decomposition of the Markovian noise under a lazy transition matrix, separating it into a martingale-difference term and residual terms. The residuals are controlled via sensitivity analysis of the Poisson equation solution with respect to both the Q-function estimate and the learning policy. These techniques may extend to other RL algorithms with time-varying policies, such as single-timescale actor-critic methods and learning-in-games algorithms.