Generalization Limits of Reinforcement Learning Alignment
The safety of large language models (LLMs) relies on alignment techniques such as reinforcement learning from human feedback (RLHF). However, recent theoretical analyses suggest that reinforcement learning-based training does not acquire new capabilities but merely redistributes the utilization probabilities of existing ones. In this study, we propose “compound jailbreaks” targeting OpenAI gpt-oss-20b, which exploit the generalization failures of alignment. This approach combines multiple attack techniques — each individually defended against — to saturate the instruction hierarchy maintenance process. […]