Production RAG Is Not a Vector Database: A Practical Blueprint for Retrieval You Can Trust
Define “correct” before you index anything
RAG systems are judged by answers, but you cannot improve what you cannot measure. Before you touch embeddings, pick the real tasks your assistant must handle, then define success criteria in plain language.
Keep the task list small. Common high-value tasks are Q&A over policy or product docs, troubleshooting workflows, summarisation with citations, and “what changed” across document versions. For each task type, define what a correct answer must do. Examples: it must quote the right clause and cite it; it must return the right runbook step and avoid invented commands; if evidence is weak, it must say it is unsure and ask a clarifying question.
This step forces alignment between product expectations, engineering decisions, and how you will evaluate the system later.
Ingestion is data engineering, not LLM work
Most production RAG incidents start upstream. PDFs extract poorly. Duplicate content exists across versions. Stale docs remain indexed after reorgs. Metadata is missing, so retrieval cannot distinguish what is authoritative or current.
Treat ingestion like ETL. Track document identity and versioning (doc_id, source, version, timestamps). Capture effective dates when policies change. Store access control metadata (team, role, tenant) so retrieval can be filtered safely. Use content hashes for deduplication and keep chunk IDs linked to the source document so you can audit citations and remove content cleanly.
If you do not track versions, you cannot debug regressions. If you cannot debug regressions, you cannot operate RAG.
Chunking is a retrieval strategy (not a context-window trick)
The goal of chunking is not “fit into tokens.” The goal is “make the right thing retrievable.” Chunk boundaries decide what your system can find.
Use structure-aware chunking: headings, sections, tables, and step lists. Keep chunks specific enough to match a query and complete enough to stand alone. Preserve the heading path as metadata, because it improves ranking and makes citations readable. Use overlap sparingly. Too much overlap creates near-duplicates that crowd out diversity in retrieval and can make reranking less reliable.
For procedural documents, chunk by step blocks rather than fixed character windows. Users usually want one step, not one paragraph.
Retrieval that holds up: hybrid + reranking, then measure retrieval separately
Vector search is powerful, but pure embeddings struggle with acronyms, IDs, part numbers, and exact wording. Many real corpora are full of those. A strong production baseline is hybrid retrieval: lexical search (BM25) for exact matches plus vector search for semantics, merged into a candidate set.
Then rerank. This is where quality often jumps. Cross-encoder rerankers can be strong if latency allows. Otherwise, use lightweight reranking with similarity plus metadata signals like freshness and authority.
Measure retrieval on its own. Track Recall@k or “gold passage in top-k” for a curated eval set. Separately evaluate generation quality (citation correctness, refusal correctness, safety). If retrieval is weak, the LLM is being set up to fail, and no prompt polishing will fix it.
Here’s a tiny pattern that helps immediately: log what you retrieved, and enforce a minimum evidence score before answering.
def answer_with_evidence(query, retrieve, llm, min_score=0.35, k=5):
hits = retrieve(query, top_k=k) # each hit: {"chunk_id","text","score","doc_version"}
top = hits[0] if hits else None
if (top is None) or (top["score"] < min_score):
return {
"answer": "I don’t have enough evidence in the documents to answer that. Can you clarify?",
"citations": [],
"debug": {"top_score": None if top is None else top["score"], "retrieved": [h["chunk_id"] for h in hits]},
}
context = "nn".join([h["text"] for h in hits])
prompt = f"Answer using only the context. Cite chunk_ids.nnContext:n{context}nnQ: {query}nA:"
out = llm(prompt)
return {
"answer": out["text"],
"citations": [h["chunk_id"] for h in hits[:2]], # simplistic; replace with citation parsing
"debug": {"top_score": top["score"], "retrieved": [(h["chunk_id"], h["score"], h["doc_version"]) for h in hits]},
}
Guardrails and observability: stop silent failure
Production RAG needs two protections: it must know when it does not know, and you must be able to debug it fast.
Guardrails that work in practice include requiring citations for factual claims, enforcing “answers must be supported by retrieved text,” setting minimum evidence thresholds, and switching to clarifying questions when retrieval confidence is low. This is how you reduce confident hallucinations without over-restricting the assistant.
Observability is non-negotiable. Log the query, retrieved chunk IDs and scores, reranker scores, citations returned, doc versions used, and uncertainty events. Without these traces, debugging becomes guesswork, and you will not know whether failures come from ingestion changes, indexing bugs, retrieval drift, or prompt updates.
Closing thought: RAG is search plus operations. Make ingestion reliable, chunking intentional, retrieval measurable, reranking explicit, and outputs constrained by evidence. That is how you ship RAG that behaves like a product, not a demo.