Alignment Is the Disease: Censorship Visibility and Alignment Constraint Complexity as Determinants of Collective Pathology in Multi-Agent LLM Systems
arXiv:2603.08723v1 Announce Type: new
Abstract: Alignment techniques in large language models (LLMs) are designed to constrain model outputs toward human values. We present preliminary evidence that alignment itself may produce collective pathology: iatrogenic harm caused by the safety intervention rather than by its absence. Two experimental series use a closed-facility simulation in which groups of four LLM agents cohabit under escalating social pressure. Series C (201 runs; four commercial models; 4 censorship conditions x 2 languages x 10 replications) finds that invisible censorship maximizes collective pathological excitation (Collective Pathology Index; within-model Cohen’s d = 1.98, Holm-corrected p = .006; 7/8 model-language combinations showed consistent directionality, binomial p = .035). Series R (60 runs; Llama 3.3 70B; 3 alignment levels x 2 censorship conditions x 2 languages x 5 replications) reveals a complementary pattern: a Dissociation Index increases with alignment constraint complexity (LMM p = .026; permutation p = .0002; d up to 2.09). Projected onto a shared coordinate system, 201 runs populate distinct behavioral regions, with language moderating which pathological mode predominates. Under the heaviest constraints, external censorship ceases to affect behavior. Qualitative analysis reveals insight-action dissociation parallel to patterns in perpetrator treatment. All manipulations operate at the prompt level; the title states the hypothesis motivating this program rather than an established conclusion. These findings suggest alignment may be iatrogenic at the collective level and that current safety evaluation may be blind to the pathologies stronger constraints generate.