Cost-Aware Query Routing in RAG: Empirical Analysis of Retrieval Depth Tradeoffs

digitado ⋅ 16 de April de 2026

When a large language model (LLM) answers a question using retrieved documents, a well known technique called retrieval-augmented generation (RAG) is used most of the time, while retrieving more documents improves answer accuracy but increases cost and response time, on the other hand retrieving fewer documents saves resources but may miss critical information. Most existing RAG systems sidestep this dilemma by applying the same retrieval setting to every query, regardless of how simple or complex the question actually is. This wastes budget on easy questions and under-serves hard ones. This paper introduces Cost-Aware RAG (CA-RAG), a routing framework that solves this problem by treating each query individually. For every incoming question, CA-RAG selects the most suitable retrieval strategy from a fixed menu of four options: Starts from no retrieval at all to fetching the top document-k = 10 most relevant documents. The selection is driven by a scoring formula that balances expected answer quality against predicted cost and response time. The weights in this formula act as dials: adjusting them shifts the system toward speed, savings, or quality without any retraining. CA-RAG is built on Facebook AI Similarity Search (FAISS) for document retrieval and the OpenAI chat and embedding application programming interfaces (APIs). We evaluate CA-RAG on a benchmark of 28 queries. The router intelligently assigns different strategies to different queries, resulting in 26% fewer billed tokens compared to always using heavy retrieval, and 34% lower response time compared to always answering directly without retrieval with excellent answer quality in both cases. Further analysis shows that most savings come from simpler queries, where heavy retrieval was never necessary to begin with. All results are reproducible from logged comma-separated values (CSV) files. CA-RAG demonstrates that a small but well-designed set of retrieval strategies combined with lightweight per-query routing can meaningfully reduce the cost and latency of LLM deployments without compromising the quality of answers.

Like 0

Liked Liked