Benchmarking Cross-Lingual Semantic Alignment in Multilingual Embeddings

arXiv:2601.09732v1 Announce Type: new
Abstract: With hundreds of multilingual embedding models available, practitioners lack clear guidance on which provide genuine cross-lingual semantic alignment versus task performance through language-specific patterns. Task-driven benchmarks (MTEB) may mask fundamental alignment shortcomings. We introduce Semantic Affinity (SA), a bounded (between 0 and 1) metric measuring inter-lingual to intra-lingual spread ratio using cosine distance, combined with PHATE visualization in our Semanscope framework. Benchmarking 13 models across 4 datasets (52 experiments) reveals a three-tier structure: (1) Top BERT models (LaBSE SA = 0.70, USE SA = 0.68, S-BERT SA = 0.68) achieve strong alignment via translation-pair supervision; (2) LLM embeddings plateau at SA between 0.55 and 0.61 regardless of 0.6 B to 8 B scale; (3) MLM-only BERT models (mBERT, XLM-R, SA < 0.50) fail despite more than 100 language training. Training objective, not architecture or scale, determines alignment. Oracle Bone primitives (1200 BCE) expose semantic drift-models learn corpus patterns rather than cognitive primitives. This work provides semantic benchmarking to help practitioners select quality multilingual embeddings from hundreds of available models, showing cross-lingual alignment requires explicit translation supervision, not merely model scale or multilingual data.

Liked Liked