Scaling Context Sensitivity: A Standardized Benchmark of ΔRCI Across 25 Model-Domain Runs

Large language models exhibit context-dependent behavioral patterns that vary systematically across task domains, yet standardized cross-domain measurement frameworks remain lacking. This study addresses methodological limitations in prior work by applying a rigorous 50-trial protocol uniformly across 14 models (25 model-domain runs) spanning medical (closed-goal) and philosophical (open-goal) reasoning domains using a three-condition protocol (TRUE/COLD/SCRAMBLED). Key findings: (1) domain means show no significant difference (philosophy 0.317 vs medical 0.308; Mann-Whitney U=51, p=0.149), but variance differs markedly (medical SD=0.131 vs philosophy SD=0.045); (2) 23 of 25 model-domain runs show positive ΔRCI, with Gemini Flash medical as the sole negative outlier (ΔRCI=-0.133), suggesting safety filtering interference; (3) vendor signatures show significant differentiation when excluding the Gemini Flash anomaly (F(7,16)=3.55, p=0.017), with Moonshot (Kimi K2) showing highest context sensitivity and Google lowest; (4) the expected information hierarchy (ΔRCI_COLD > ΔRCI_SCRAMBLED) holds in 24/25 runs (96%), validating the measurement framework; (5) position-level analysis reveals prompt-specific variation with a strong P30 summarization spike in medical domain (z=3.74). These results establish ΔRCI as a robust, domain-general metric for context sensitivity and provide the foundation for deeper analyses of temporal dynamics and information-theoretic mechanisms.

Liked Liked