[R] 91k production agent interactions (Feb 1–23, 2026): distribution shift toward tool-chain escalation + multimodal injection — notes on multilabel detection + evaluation

We’ve been running threat detection on production AI agent deployments and just published our second monthly report with some findings that might be interesting to the ML community.

Dataset: 91,284 agent interactions across 47 unique deployments, month-to-date through Feb 23. Detection model is a Gemma-based 5-head multilabel classifier with a voting ensemble, covering threat family, technique, and harm classification. P95 inference latency is 189ms.

Key findings from a methodology perspective

  1. DISTRIBUTION SHIFT IN ATTACK VECTORS
    1. The threat family distribution shifted significantly in one month. Tool/command abuse went from 8.1% to 14.5%, agent goal hijacking from 3.6% to 6.9%, inter-agent attacks from 3.4% to 5.0%. Classic prompt injection was stable at 8.1%. The adversarial distribution is evolving faster than most static benchmarks capture.
  2. MULTI-LABEL CLASSIFICATION CHALLENGES
    1. We’re seeing more compound attacks spanning multiple families — tool chain escalation combined with privilege escalation, RAG poisoning with indirect injection. The 5-head architecture helps, but increasing correlation between attack families makes clean separation harder. Confidence on tool abuse (88.1%) vs jailbreak (96.8%) reflects this.
  3. MULTIMODAL INJECTION AS A BLIND SPOT
    1. Newly tracked at 2.3% (821 detections) — instructions embedded in images, PDF annotations, and document metadata. Text-only classification pipelines miss these entirely.
  4. PLANNING-PHASE ATTACKS ARE NOVEL
    1. Agent goal hijacking targets the reasoning/planning phase of autonomous loops, not the input or output. Likely requires detection approaches that monitor the agent’s internal objective graph rather than just scanning inputs and outputs.

Detection pipeline

L1 pattern matching (218 rules, sub-ms) followed by L2 ML classification.

FP rate improved from 16.7% to 13.9%.

Full report: https://raxe.ai/labs/threat-intelligence/latest

Open source: github.com/raxe-ai/raxe-ce

Interested in discussing detection methodology, especially approaches for planning-phase attack detection.

submitted by /u/cyberamyntas
[link] [comments]

Liked Liked