The 30-Day Roadmap to Building a Production RAG System (That Doesn’t Hallucinate)
RAG is the most practical AI pattern for most businesses. Most implementations are broken in ways teams don’t discover until users find them.

There’s a demo that plays out in conference rooms and investor decks with remarkable consistency.
Someone builds a chatbot. They load their company’s documentation into a vector database, hook it up to an LLM, ask it a few questions in front of an audience, and it gives back impressive, accurate, fluent answers. Everyone nods. Someone says “this is exactly what we need.”
Then it goes to users.
And users ask questions that are slightly different from the demo questions. They ask about things that are mentioned in the documents but never explicitly stated. They ask follow-up questions. They phrase things in ways the demo never covered. And the system starts hallucinating — confidently generating answers that sound exactly like the correct answer but aren’t.
RAG (Retrieval Augmented Generation) is genuinely one of the most powerful and practical AI patterns available right now. But naive RAG — the version that most tutorials produce — fails in ways that aren’t obvious until real users find them. This roadmap is about building the version that doesn’t.
Why Naive RAG Fails
Before the roadmap, it’s worth understanding the failure modes you’re building against. Most RAG failures fall into three categories:
Retrieval failure: The right document exists, but the system doesn’t find it. The query and the document don’t match semantically even though they’re about the same topic. This happens more often than people expect — especially when user language differs from documentation language.
Context failure: The system retrieves documents, but the retrieved chunks don’t contain enough context to answer the question. The answer is split across two chunks that were never retrieved together, or the relevant sentence requires surrounding paragraphs to make sense.
Generation failure: The retrieval worked correctly, but the LLM ignores the retrieved content, blends it with its training data, or fills in gaps it shouldn’t fill. The output sounds correct. It isn’t.
Each week of this roadmap addresses one layer of the system, and most of the advanced techniques exist specifically to prevent one of these three failure modes.
Week 1: The Foundation — Ingestion and Chunking (Days 1–7)
Goal: Get your data into a form that can actually be retrieved well. This is the least exciting week and the one that determines everything else.
Day 1–2: Understanding Your Data First
Before writing any code, read 50 representative documents from your source data. This sounds tedious. Do it anyway. You’re looking for:
- How long are typical documents?
- Are they structured (headers, sections) or freeform prose?
- Are answers typically self-contained in one paragraph, or spread across a document?
- What’s the vocabulary like? Is it technical jargon your users might not use?
- Are there tables, lists, or code that need special handling?
The answers determine your chunking strategy, your embedding model choice, and your retrieval parameters. Teams that skip this step spend weeks debugging a retrieval system that was misconfigured for their data from the start.
Day 3–5: Chunking Strategies That Actually Work
from dataclasses import dataclass
from typing import Generator
import re
@dataclass
class Chunk:
content: str
metadata: dict
chunk_id: str
source_doc_id: str
def naive_fixed_chunking(text: str, chunk_size: int = 500,
overlap: int = 50) -> list[str]:
"""
Simplest approach. Works okay for uniform prose.
Fails badly for structured documents.
"""
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunks.append(" ".join(words[i:i + chunk_size]))
return chunks
def semantic_chunking(text: str) -> list[str]:
"""
Split on natural boundaries: headers, paragraphs, list items.
Preserves semantic coherence better than fixed windows.
Better for documentation, articles, reports.
"""
# Split on Markdown headers or double newlines
sections = re.split(r'(?=^#{1,3}s)|(?:nn+)', text, flags=re.MULTILINE)
chunks = []
current_chunk = []
current_size = 0
target_size = 400 # words
for section in sections:
section_words = len(section.split())
if current_size + section_words > target_size and current_chunk:
chunks.append(" ".join(current_chunk))
current_chunk = []
current_size = 0
current_chunk.append(section.strip())
current_size += section_words
if current_chunk:
chunks.append(" ".join(current_chunk))
return [c for c in chunks if len(c.strip()) > 30]
def hierarchical_chunking(document: dict) -> list[Chunk]:
"""
Parent-child chunking: index large parent chunks,
retrieve small child chunks, return parent for context.
Retrieves precise matches, returns rich context.
Best for detailed technical documentation.
"""
chunks = []
doc_id = document["id"]
# Split into large sections (parents)
sections = semantic_chunking(document["content"])
for sec_idx, section in enumerate(sections):
parent_id = f"{doc_id}_section_{sec_idx}"
# Split each section into smaller chunks (children)
sentences = re.split(r'(?<=[.!?])s+', section)
child_chunks = []
current = []
for sentence in sentences:
current.append(sentence)
if len(" ".join(current).split()) >= 100:
child_chunks.append(" ".join(current))
current = []
if current:
child_chunks.append(" ".join(current))
for child_idx, child_content in enumerate(child_chunks):
chunks.append(Chunk(
content=child_content,
metadata={
"parent_id": parent_id,
"parent_content": section, # Full section for context
"source": document.get("source", ""),
"section_index": sec_idx,
},
chunk_id=f"{parent_id}_chunk_{child_idx}",
source_doc_id=doc_id
))
return chunks
Day 6–7: Build the Indexing Pipeline
import chromadb
import voyageai
from tqdm import tqdm
voyage_client = voyageai.Client()
chroma_client = chromadb.PersistentClient(path="./rag_db")
collection = chroma_client.get_or_create_collection(
name="knowledge_base",
metadata={"hnsw:space": "cosine"}
)
def index_documents(documents: list[dict], batch_size: int = 50):
"""
Index documents in batches.
Embedding API calls are the bottleneck — batch them.
"""
all_chunks = []
for doc in documents:
all_chunks.extend(hierarchical_chunking(doc))
print(f"Indexing {len(all_chunks)} chunks...")
for i in tqdm(range(0, len(all_chunks), batch_size)):
batch = all_chunks[i:i + batch_size]
texts = [chunk.content for chunk in batch]
# voyage-3 is strong for RAG use cases
embeddings = voyage_client.embed(
texts, model="voyage-3"
).embeddings
collection.add(
ids=[chunk.chunk_id for chunk in batch],
embeddings=embeddings,
documents=texts,
metadatas=[chunk.metadata for chunk in batch]
)
Week 2: Retrieval That Actually Works (Days 8–14)
Goal: Make the system find the right content, not just similar content.
Day 8–10: Query Expansion and Rewriting
The most common retrieval failure: the user asks a question in different words than your documentation uses. The semantic similarity is there conceptually, but the embedding distance is wide.
def expand_query(user_query: str) -> list[str]:
"""
Generate multiple phrasings of the same question.
Retrieve for all of them. Union the results.
Dramatically improves recall for paraphrased questions.
"""
prompt = f"""
Generate 3 alternative phrasings of this question that preserve the meaning
but use different vocabulary. Return ONLY a JSON array of strings.
Question: {user_query}
"""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=256,
messages=[{"role": "user", "content": prompt}]
)
import json
alternatives = json.loads(response.content[0].text)
return [user_query] + alternatives # Original + 3 alternatives
def retrieve_with_expansion(query: str, n_results: int = 5) -> list[dict]:
"""Multi-query retrieval with deduplication"""
expanded_queries = expand_query(query)
seen_ids = set()
all_results = []
for q in expanded_queries:
embedding = voyage_client.embed([q], model="voyage-3").embeddings[0]
results = collection.query(
query_embeddings=[embedding],
n_results=n_results
)
for doc_id, doc, meta, distance in zip(
results["ids"][0], results["documents"][0],
results["metadatas"][0], results["distances"][0]
):
if doc_id not in seen_ids:
seen_ids.add(doc_id)
all_results.append({
"id": doc_id,
"content": doc,
"metadata": meta,
"score": 1 - distance # Convert distance to similarity
})
# Sort by score, return top results
return sorted(all_results, key=lambda x: x["score"], reverse=True)[:n_results]
Day 11–12: Hybrid Search
Pure vector search misses exact keyword matches. Pure keyword search misses semantic meaning. Hybrid search does both.
from rank_bm25 import BM25Okapi
class HybridRetriever:
def __init__(self, chunks: list[Chunk]):
self.chunks = chunks
# Build BM25 index for keyword search
tokenized = [c.content.lower().split() for c in chunks]
self.bm25 = BM25Okapi(tokenized)
def keyword_search(self, query: str, top_k: int = 10) -> list[tuple[int, float]]:
"""BM25 keyword search — returns (index, score) pairs"""
scores = self.bm25.get_scores(query.lower().split())
top_indices = scores.argsort()[-top_k:][::-1]
return [(int(i), float(scores[i])) for i in top_indices]
def hybrid_search(self, query: str, top_k: int = 5,
keyword_weight: float = 0.3) -> list[dict]:
"""
Combine keyword and semantic search scores.
Reciprocal Rank Fusion (RRF) is robust and doesn't need tuning.
"""
query_embedding = voyage_client.embed([query], model="voyage-3").embeddings[0]
# Vector search results
vector_results = collection.query(
query_embeddings=[query_embedding],
n_results=20
)
vector_ranks = {
doc_id: rank
for rank, doc_id in enumerate(vector_results["ids"][0])
}
# Keyword search results
keyword_results = self.keyword_search(query, top_k=20)
keyword_ranks = {
self.chunks[idx].chunk_id: rank
for rank, (idx, _) in enumerate(keyword_results)
}
# Reciprocal Rank Fusion
all_ids = set(vector_ranks) | set(keyword_ranks)
k = 60 # RRF constant
rrf_scores = {}
for doc_id in all_ids:
vector_rank = vector_ranks.get(doc_id, len(vector_ranks))
keyword_rank = keyword_ranks.get(doc_id, len(keyword_ranks))
rrf_scores[doc_id] = (
(1 - keyword_weight) * (1 / (k + vector_rank)) +
keyword_weight * (1 / (k + keyword_rank))
)
top_ids = sorted(rrf_scores, key=rrf_scores.get, reverse=True)[:top_k]
# Fetch full content and metadata
results = collection.get(ids=top_ids, include=["documents", "metadatas"])
return [
{"id": rid, "content": doc, "metadata": meta, "score": rrf_scores[rid]}
for rid, doc, meta in zip(
results["ids"], results["documents"], results["metadatas"]
)
]
Day 13–14: Contextual Retrieval
Before indexing, prepend a brief summary of each chunk’s context to the chunk itself. This dramatically improves retrieval for chunks that are only meaningful with surrounding context.
def add_chunk_context(chunk: Chunk, full_document: str) -> Chunk:
"""
Generate a brief context header for each chunk before indexing.
Anthropic research shows this reduces retrieval failures by ~49%.
"""
prompt = f"""
Here is a document:
<document>
{full_document[:3000]} # First 3000 chars for context
</document>
Here is a specific chunk from that document:
<chunk>
{chunk.content}
</chunk>
Write a very brief (1-2 sentence) description of what this chunk is about
and where it sits in the document. This will be prepended to the chunk for
retrieval purposes. Reply with ONLY the description, no preamble.
"""
context = client.messages.create(
model="claude-haiku-4-5-20251001", # Cheap model — this runs at index time
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
).content[0].text.strip()
return Chunk(
content=f"{context}nn{chunk.content}",
metadata=chunk.metadata,
chunk_id=chunk.chunk_id,
source_doc_id=chunk.source_doc_id
)
Week 3: Generation That Doesn’t Hallucinate (Days 15–21)
Goal: Get the LLM to use the retrieved context faithfully. Build the generation layer with explicit grounding.
Day 15–17: Grounded Prompt Design
python
def generate_grounded_response(
query: str,
retrieved_chunks: list[dict],
conversation_history: list[dict] = None
) -> dict:
"""
Generation with explicit grounding and citation tracking.
Returns response + which sources were used.
"""
context_blocks = []
for i, chunk in enumerate(retrieved_chunks):
source = chunk["metadata"].get("source", f"Source {i+1}")
context_blocks.append(f"[{i+1}] {source}:n{chunk['content']}")
context_str = "nn".join(context_blocks)
system_prompt = """You are a helpful assistant that answers questions
using ONLY the provided context.
Rules:
1. Only use information explicitly stated in the context.
2. If the context doesn't contain the answer, say exactly:
"I don't have enough information to answer this question."
3. Do not infer, extrapolate, or use knowledge outside the context.
4. Cite your sources using [1], [2], etc. after each claim.
5. If different sources contradict each other, say so explicitly."""
messages = conversation_history or []
messages.append({
"role": "user",
"content": f"Context:n{context_str}nnQuestion: {query}"
})
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=system_prompt,
messages=messages
)
answer = response.content[0].text
# Extract which sources were actually cited
cited_sources = [
chunk for i, chunk in enumerate(retrieved_chunks)
if f"[{i+1}]" in answer
]
return {
"answer": answer,
"cited_sources": cited_sources,
"total_retrieved": len(retrieved_chunks),
"sources_used": len(cited_sources)
} context_str = "nn".join(context_blocks)
Day 18–21: Handling the Hard Cases
def rag_with_confidence(query: str) -> dict:
"""
Full RAG pipeline with confidence assessment.
Catches the cases where retrieval found something but the answer
still isn't actually in the retrieved content.
"""
# Retrieve
chunks = hybrid_retriever.hybrid_search(query, top_k=5)
# Check if retrieved content is actually relevant enough
top_score = chunks[0]["score"] if chunks else 0
# Low confidence retrieval — honest fallback
if top_score < 0.65:
return {
"answer": "I wasn't able to find reliable information about this in the knowledge base. This question might be outside the scope of available documentation.",
"confidence": "low",
"retrieved_chunks": chunks,
"fallback": True
}
# Generate with context
result = generate_grounded_response(query, chunks)
# Assess whether the answer used the context or hallucinated
if result["sources_used"] == 0 and "don't have" not in result["answer"].lower():
# Model gave an answer without citing any source — suspicious
return {
"answer": result["answer"],
"confidence": "low",
"warning": "Response may not be grounded in retrieved sources",
**result
}
return {**result, "confidence": "high"} # Check if retrieved content is actually relevant enough
top_score = chunks[0]["score"] if chunks else 0
Week 4: Evaluation and Production Hardening (Days 22–30)
Goal: Measure whether your RAG system actually works. Catch regressions. Ship confidently.
Day 22–25: Building a RAG Evaluation Suite
python
from dataclasses import dataclass
@dataclass
class RAGEvalCase:
question: str
expected_answer_contains: list[str] # Key facts that should appear
relevant_doc_ids: list[str] # Documents that should be retrieved
def evaluate_retrieval(case: RAGEvalCase) -> dict:
"""
Measure retrieval quality independently from generation quality.
Separates the two failure modes.
"""
retrieved = hybrid_retriever.hybrid_search(case.question, top_k=5)
retrieved_ids = [r["id"] for r in retrieved]
# Recall: did we retrieve the relevant documents?
relevant_retrieved = [
doc_id for doc_id in case.relevant_doc_ids
if any(doc_id in r_id for r_id in retrieved_ids)
]
recall = len(relevant_retrieved) / len(case.relevant_doc_ids)
# MRR: was the best document ranked highly?
mrr = 0.0
for rank, r_id in enumerate(retrieved_ids, 1):
if any(rel_id in r_id for rel_id in case.relevant_doc_ids):
mrr = 1.0 / rank
break
return {
"recall": recall,
"mrr": mrr,
"retrieved_ids": retrieved_ids,
"found_relevant": len(relevant_retrieved) > 0
}
def evaluate_generation(case: RAGEvalCase, answer: str) -> dict:
"""
Evaluate whether generated answer contains expected facts.
Uses LLM judge for nuanced assessment.
"""
facts_present = []
for fact in case.expected_answer_contains:
judge_prompt = f"""
Does the following answer contain this fact (even if worded differently)?
Fact: {fact}
Answer: {answer}
Reply with only: YES or NO
"""
judgment = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=10,
messages=[{"role": "user", "content": judge_prompt}]
).content[0].text.strip()
facts_present.append(judgment == "YES")
return {
"factual_accuracy": sum(facts_present) / len(facts_present),
"facts_found": facts_present,
"perfect_score": all(facts_present)
}
def run_full_eval(eval_suite: list[RAGEvalCase]) -> dict:
retrieval_scores, generation_scores = [], []
for case in eval_suite:
# Evaluate retrieval
ret_metrics = evaluate_retrieval(case)
retrieval_scores.append(ret_metrics)
# Evaluate generation
result = rag_with_confidence(case.question)
gen_metrics = evaluate_generation(case, result["answer"])
generation_scores.append(gen_metrics)
print(f"Q: {case.question[:60]}...")
print(f" Retrieval recall: {ret_metrics['recall']:.0%} | "
f"MRR: {ret_metrics['mrr']:.2f}")
print(f" Answer accuracy: {gen_metrics['factual_accuracy']:.0%}")
avg_recall = sum(s["recall"] for s in retrieval_scores) / len(retrieval_scores)
avg_accuracy = sum(s["factual_accuracy"] for s in generation_scores) / len(generation_scores)
print(f"nOverall: Retrieval Recall={avg_recall:.0%} | Answer Accuracy={avg_accuracy:.0%}")
return {"recall": avg_recall, "accuracy": avg_accuracy}
Day 26–30: Production Monitoring
Track these metrics per day, not just at launch:
- Retrieval confidence score distribution — if scores drift down, your data may have changed and needs re-indexing
- Fallback rate — how often are you returning “I don’t know”? If it increases, retrieval is degrading
- User satisfaction signals — thumbs up/down, follow-up questions, escalations to human support
- Latency — RAG adds retrieval time. Monitor P95, not just average. Query expansion adds extra latency — measure it.
def rag_with_monitoring(query: str, user_id: str) -> dict:
import time
start = time.monotonic()
retrieval_start = time.monotonic()
chunks = hybrid_retriever.hybrid_search(query, top_k=5)
retrieval_ms = (time.monotonic() - retrieval_start) * 1000
generation_start = time.monotonic()
result = generate_grounded_response(query, chunks)
generation_ms = (time.monotonic() - generation_start) * 1000
total_ms = (time.monotonic() - start) * 1000
# Log to your metrics system
log_rag_metrics({
"user_id": user_id,
"retrieval_ms": retrieval_ms,
"generation_ms": generation_ms,
"total_ms": total_ms,
"top_retrieval_score": chunks[0]["score"] if chunks else 0,
"sources_cited": result.get("sources_used", 0),
"is_fallback": result.get("fallback", False),
})
return result chunks = hybrid_retriever.hybrid_search(query, top_k=5)
retrieval_ms = (time.monotonic() - retrieval_start) * 1000
The Uncomfortable Truth
Naive RAG — embed documents, retrieve top-K, stuff in prompt — will work in demos and fail in production. The failures are predictable: retrieval misses semantically adjacent questions, retrieved chunks lack context, the model fills gaps with confident hallucinations.
The advanced techniques in this roadmap — contextual chunking, query expansion, hybrid search, hierarchical retrieval, grounded prompts, LLM-judged evaluation — each exist to prevent one of those specific failures. None of them are theoretically interesting. All of them are practically necessary.
Teams that build the demo version ship it, see the failure rate, and conclude that “RAG doesn’t work for our use case.” Teams that build the production version ship it, run the evaluation suite, and discover that the system works well enough to replace a meaningful portion of their manual support load.
The difference isn’t the AI. It’s the engineering.
The 30-Day Roadmap to Building a Production RAG System (That Doesn’t Hallucinate) was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.