Learning Kernel-Based MDPs from Episodic Preferential Feedback
Human feedback often arrives as preferences rather than calibrated numeric rewards, motivating reinforcement learning from preferential feedback, also referred to as reinforcement learning from human feedback (RLHF). We present a rigorous theoretical study of preference-only learning in episodic kernel MDPs. In each episode, the learner deploys two policies from a common start state and receives a single binary label indicating which trajectory is preferred, modeled by a Bradley–Terry–Luce link on the difference of cumulative (unobserved) rewards. Under kernel-based […]