Safety Training Persists Through Helpfulness Optimization in LLM Agents
arXiv:2603.02229v1 Announce Type: new Abstract: Safety post-training has been studied extensively in single-step “chat” settings where safety typically refers to refusing harmful requests. We study an “agentic” (i.e., multi-step, tool-use) setting where safety refers to harmful actions directly taken by the LLM. We compare the effects of running direct preference optimization (DPO) on safety or helpfulness alone vs both metrics sequentially. As expected, training on one metric alone results in an extreme point along this frontier. However, unlike […]