Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings
arXiv:2508.11847v3 Announce Type: replace Abstract: We propose a method for evaluating the robustness of widely used LLM ranking systems — variants of a Bradley–Terry model — to dropping a worst-case very small fraction of preference data. Our approach is computationally fast and easy to adopt. When we apply our method to matchups from popular LLM ranking platforms, including Chatbot Arena and derivatives, we find that the rankings of top-performing models can be remarkably sensitive to the removal of […]