Soft-Label Governance for Distributional Safety in Multi-Agent Systems

digitado ⋅ 23 de April de 2026

arXiv:2604.19752v1 Announce Type: new
Abstract: Multi-agent AI systems exhibit emergent risks that no single agent produces in isolation. Existing safety frameworks rely on binary classifications of agent behavior, discarding the uncertainty inherent in proxy-based evaluation. We introduce SWARM (textbf{S}ystem-textbf{W}ide textbf{A}ssessment of textbf{R}isk in textbf{M}ulti-agent systems), a simulation framework that replaces binary good/bad labels with emph{soft probabilistic labels} $p = P(v{=}+1) in [0,1]$, enabling continuous-valued payoff computation, toxicity measurement, and governance intervention. SWARM implements a modular governance engine with configurable levers (transaction taxes, circuit breakers, reputation decay, and random audits) and quantifies their effects through probabilistic metrics including expected toxicity $mathbb{E}[1{-}p mid text{accepted}]$ and quality gap $mathbb{E}[p mid text{accepted}] – mathbb{E}[p mid text{rejected}]$. Across seven scenarios with five-seed replication, strict governance reduces welfare by over 40% without improving safety. In parallel, aggressively internalizing system externalities collapses total welfare from a baseline of $+262$ down to $-67$, while toxicity remains invariant. Circuit breakers require careful calibration; overly restrictive thresholds severely diminish system value, whereas an optimal threshold balances moderate welfare with minimized toxicity. Companion experiments show soft metrics detect proxy gaming by self-optimizing agents passing conventional binary evaluations. This basic governance layer applies to live LLM-backed agents (Concordia entities, Claude, GPT-4o Mini) without modification. Results show distributional safety requires emph{continuous} risk metrics and governance lever calibration involves quantifiable safety-welfare tradeoffs. Source code and project resources are publicly available at https://www.swarm-ai.org/.

Like 0

Liked Liked