Separating Semantic Expansion from Linear Geometry for PubMed-Scale Vector Search

arXiv:2601.05268v1 Announce Type: new
Abstract: We describe a PubMed scale retrieval framework that separates semantic interpretation from metric geometry. A large language model expands a natural language query into concise biomedical phrases; retrieval then operates in a fixed, mean free, approximately isotropic embedding space. Each document and query vector is formed as a weighted mean of token embeddings, projected onto the complement of nuisance axes and compressed by a Johnson Lindenstrauss transform. No parameters are trained. The system retrieves coherent biomedical clusters across the full MEDLINE corpus (about 40 million records) using exact cosine search on 256 dimensional int8 vectors. Evaluation is purely geometric: head cosine, compactness, centroid closure, and isotropy are compared with random vector baselines. Recall is not defined, since the language-model expansion specifies the effective target set.

Liked Liked