Detection and Mitigation of Mythos-Class Frontier Model Capabilities: A Layered Reference Architecture
Anthropic’s April 2026 Claude Mythos Preview release established a new operational category of frontier AI systems—Mythos-class—whose capability profile (extended-context reasoning over codebases, recursive self-correction, native system-tool integration, and agentic scaffolding at deployable scale) renders the dominant AI safety paradigms insufficient as sole controls. Reinforcement learning from human feedback, post-generation output filtering, contractual access vetting, and human-in-the-loop supervision were each calibrated to a generation of systems that did not exhibit autonomous cyber capability at the levels Mythos-class systems now demonstrate, and each is insufficient as a sole control against the new category under the threat assumptions specified here. This paper develops a defense-in-depth reference architecture for detecting and mitigating Mythos-class capability across enterprise and federal deployment surfaces. Detection is structured as a three-tier framework spanning pre-deployment evaluation, deployment-time access and telemetry, and runtime behavioral signatures. Mitigation is structured as four concentric layers: governance, cryptographic enforcement, architectural isolation, and operational monitoring. The cryptographic enforcement layer specifies an authority-binding architecture using post-quantum-attested provenance to bind output release to a verifiable authority chain. The architecture is mapped to the NIST AI Risk Management Framework, the NIST Cybersecurity Framework (CSF) 2.0, and the CISA Zero Trust Maturity Model, and is demonstrated against three application cases: post-quantum cryptography migration, federal AI supply-chain assurance, and critical-infrastructure operational technology defense. Limitations and a research agenda for empirical calibration are stated explicitly.