LLM-TOC: LLM-Driven Theory-of-Mind Adversarial Curriculum for Multi-Agent Generalization

Zero-shot generalization to out-of-distribution (OOD) teammates and opponents in open-ended multi-agent systems (MAS) remains a fundamental challenge for general-purpose AI. Existing multi-agent reinforcement learning (MARL) paradigms, such as self-play and population-based training, often collapse to a limited subset of Nash equilibria, leaving agents brittle when faced with semantically diverse, unseen behaviors. Recent approaches that invoke large language models (LLMs) at run time can improve adaptability but introduce substantial latency and can become less reliable as task horizons grow; in contrast, LLM-assisted reward-shaping methods remain constrained by the inefficiency of the inner reinforcement-learning loop. To address these limitations, we propose LLM-TOC (LLM-Driven Theory-of-Mind Adversarial Curriculum), which casts generalization as a bi-level Stackelberg game: in the inner loop, a MARL agent (the follower) minimizes regret against a fixed population, while in the outer loop an LLM serves as a semantic oracle that generates executable adversarial or cooperative strategies in a Turing-complete code space to maximize the agent’s regret. To cope with the absence of gradients in discrete code generation, we introduce Gradient Saliency Feedback, which transforms pixel-level value fluctuations into semantically meaningful causal cues to steer the LLM toward targeted strategy synthesis. We further provide PAC-Bayes guarantees showing that LLM-TOC converges at rate ( O(1/sqrt{K}) ) and yields a tighter generalization error bound than parameter-space exploration. Experiments on the Melting Pot benchmark demonstrate that LLM-TOC consistently improves zero-shot performance over self-play baselines (IPPO, MAPPO) and the LLM-inference method Hypothetical Minds, while reducing training cost by more than 60%.

Liked Liked