Encoding Fidelity and Coherent Misalignment in Non-English Clinical AI
Shannon’s Mathematical Theory of Communication (1948) assumes encoding fidelity — that the encoder preserves the statistical structure of the source. Large Language Models show significant systematic degradation of this assumption for non-English languages, producing outputs that are internally consistent but semantically degraded. We call this failure mode Coherent Misalignment and introduce the Encoding Fidelity Index (EFI), a practical proxy measuring the preservation of semantic content across the encoding boundary. Across 4 languages (English, Kannada, Tamil, Hindi), 2 embedding models (384-dimensional, 768-dimensional), and 2 LLMs (DeepSeek V3.1, Mistral Small 24B), we find: (1) EFI degrades by ~90% for all non-English Indian languages tested (p < 10⁻¹³), independent of language family; a European language control (French, Spanish, German) confirms this is tokenizer-induced encoding loss, not inherent cross-lingual distance (p = 1.6 × 10⁻⁸, Cohen’s d = 1.33); (2) variance amplification is Dravidian-specific: Kannada shows 1.72–2.05× amplification (p < 0.05 in both models), Tamil shows partial amplification (1.63×, p = 0.016 in Mistral), while Hindi shows no amplification despite equivalent EFI degradation; (3) complex medical sentences show paradoxical EFI increase from English loanword anchoring; (4) scenario-dependent code-switching and orthographic corruption of medical terms (Mistral). These findings suggest that output-layer consistency metrics are unlikely to detect encoding-level degradation, since they measure response variance structure rather than semantic content. The dissociation between universal encoding degradation and language-specific variance amplification reveals that training data representation, not encoding fidelity alone, determines clinical reliability, with implications for non-English clinical AI deployment.