Thompson Sampling for Infinite-Horizon Discounted Decision Processes
arXiv:2405.08253v3 Announce Type: replace
Abstract: This paper develops a viable notion of learning for sampling-based algorithms that applies in broader settings than previously considered. More specifically, we model a discounted infinite-horizon MDPs with Borel state and action spaces, whose rewards and transitions depend on an unknown parameter. To analyze adaptive learning algorithms based on sampling we introduce a general canonical probability space in this setting. Since standard definitions of regret are inadequate for policy evaluation in this setting, we propose new metrics that arise from decomposing the standard expected regret in discounted infinite-horizon MDPs into three terms: (i) the expected finite-time regret, (ii) the expected state regret, and (iii) the expected residual regret. Component (i) translates into the traditional concept of expected regret over a finite horizon. Term (ii) reflects how much future performance is compromised at a given time because earlier decisions have led the system to a less favorable state than under an optimal policy. Finally, metric (iii) measures regret with respect to the optimal reward from the current period onward, disregarding the irreversible consequences of past decisions. We further disaggregate this term by introducing the probabilistic residual regret, a finer, sample-path version of (iii) that captures the remaining loss in future performance from the current period onward, conditional on the observed history. Its expectation coincides with (iii). We then focus on Thompson sampling (TS); under assumptions that extend those used in prior work on finite state and action spaces to the Borel setting, we show that component (iii) for TS converges to zero exponentially fast. We further show that, under mild conditions ensuring the existence of the relevant limits, its probabilistic counterpart converges to zero almost surely and TS achieves complete learning.