Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Modeling
arXiv:2603.22563v1 Announce Type: new Abstract: Preference-based fine-tuning has become an important component in training large language models, and the data used at this stage may contain sensitive user information. A central question is how to design a differentially private pipeline that is well suited to the distinct structure of reinforcement learning from human feedback. We propose a privacy-preserving framework that imposes differential privacy only on reward learning and derives the final policy from the resulting private reward model. […]