On the Exponential Convergence for Offline RLHF with Pairwise Comparisons

arXiv:2406.12205v2 Announce Type: replace-cross
Abstract: We consider the problem of offline reinforcement learning from human feedback (RLHF) with pairwise comparisons proposed by Zhu et al. (2023), where the implicit reward is a linear function of an unknown parameter. Given an offline dataset, our objective consists in ascertaining the optimal action for each state, with the ultimate goal of minimizing the {em simple regret}. We propose an algorithm, underline{RL} with underline{L}ocally underline{O}ptimal underline{W}eights or {sc RL-LOW}, which yields an exponential form of simple regret of $exp ( – Omega(n/H) )$ where $n$ is the number of data samples and $H$ denotes an instance-dependent hardness quantity that depends explicitly on the suboptimality gap of each action. Furthermore, we derive a first-of-its-kind instance-dependent lower bound in offline RLHF with pairwise comparisons. Interestingly, we observe that the lower and upper bounds on the simple regret match order-wise in the exponent, demonstrating order-wise optimality of our {sc RL-LOW}. In view of privacy considerations in practical applications, we also extend {sc RL-LOW} to the setting of $(varepsilon,delta)$-differential privacy and show, somewhat surprisingly, that the hardness parameter $H$ is unchanged in the asymptotic regime as $n$ tends to infinity; this underscores the inherent efficiency of {sc RL-LOW} in terms of preserving the privacy of the observed rewards. Given our focus on establishing instance-dependent bounds of exponential convergence, our research fills the research gap in existing studies that concentrate on establishing worst-case regrets of {em inverse polynomial convergence} (e.g., $widetilde{O}(frac{1}{sqrt{n}})$) for offline RLHF with pairwise comparisons.

Liked Liked