Stop Wasting PDFs — Build a RAG That Actually Understands Them
Stop Wasting PDFs — Build a RAG That Actually Understands Them
Turn messy PDFs into reliable, auditable answers — a production-ready RAG pipeline with OCR, heading-aware chunking, FAISS, cross-encoder reranking, and strict LLM prompts

TL;DR — for skimmers
- Problem: PDFs are messy — scans, tables, and long paragraphs break retrieval.
- Fix: Ingest → smart chunk → bi-encoder shortlist → cross-encoder re-rank → grounded LLM prompt.
- Result: Fewer hallucinations, auditable answers, production-grade retrieval.
- Ship in a week: Use the included end-to-end script (OCR fallback + FAISS + CrossEncoder + LLM) and run the 20-question evaluation recipe.
You’ve spent months producing product specs, compliance docs, and slide decks — yet search still returns noise. The frustration is universal: employees copy-paste from PDFs, support teams lose hours, and trust in the knowledge base quietly erodes. This article delivers a concise, battle-tested pipeline that turns messy PDFs into accurate, auditable answers.
Quick vignette: A support team at a mid-size SaaS company spent three days debating whether an old contract clause applied to a renewal. After shipping the pipeline below, they surfaced the exact paragraph — in 30 seconds — with a cited quote. The legal lead stopped asking, “Do we trust the answer?” and started asking, “How do we make more documents searchable?” Small change. Huge trust.
Why regular RAG fails on real PDFs
- Scanned pages — no extractable text without OCR.
- Bad chunking — arbitrary windows split meaning, tables, and code blocks.
- Fuzzy retrieval — bi-encoders return types of relevance, not exact passages.
- Redundancy — near-duplicate chunks bloat the shortlist and confuse rerankers.
Bold truth: Upgrading your LLM rarely fixes these problems. Fix ingestion and retrieval first.
The disciplined, production-ready pipeline (one line)
Extract reliably → Chunk by meaning → Embed & shortlist (bi-encoder) → Re-rank with a cross-encoder → Prompt the LLM with citations.
Step by step
- Extract + OCR fallback — attempt text extraction first; OCR empty pages with pdf2image + pytesseract.
- Heading- & table-aware chunking — split on semantic boundaries; avoid slicing tables mid-cell.
- Deduplicate + attach metadata — hash chunks to dedupe; retain source, page, heading, timestamps.
- Embed & store — use a fast bi-encoder (e.g., all-MiniLM-L6-v2) and store in FAISS/Chroma/Weaviate.
- Shortlist (top 30–100) — fast, scalable candidate retrieval.
- Cross-encoder re-rank — score query–doc pairs and select the top 3–10.
- Grounded LLM prompt — require chunk IDs, include exact quotes, return NOT FOUND when absent.
- Measure — track precision@k, hallucination rate, and p95 latency.
Micro-edits for skimmers — what to tune first
- Add OCR fallback for scanned PDFs.
- Tune chunk size and overlap per doc type (specs vs. slides).
- Deduplicate chunks using SHA-1 plus an embedding-similarity threshold.
- Retrieve 30–100 candidates, then re-rank the top ~10.
- Force the LLM to return chunk IDs and exact quotes.
- Monitor precision@k and hallucination rate weekly.
Full working code (end-to-end)
This is a single-file, runnable script that implements the full pipeline: OCR fallback, heading-aware chunking, FAISS vector store, bi-encoder retrieval, CrossEncoder re-ranking, and a strict LLM answer step. Save as rag_pdf_pipeline.py.
"""
rag_pdf_pipeline.py
End-to-end RAG for unstructured PDFs:
- PDF text extraction with OCR fallback
- Heading-aware chunking with overlap
- Deduplication and metadata preservation
- Embeddings (HuggingFace) + FAISS vector store
- Bi-encoder retrieval -> Cross-encoder re-rank
- LLM answer with strict grounding and citations
Dependencies:
pip install langchain sentence-transformers faiss-cpu openai PyPDF2 pdf2image pytesseract python-magic
# Optional: pip install unstructured[local] pymupdf
"""
import os
import logging
import hashlib
import re
from typing import List, Tuple, Dict, Any
from langchain.docstore.document import Document
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chat_models import ChatOpenAI
from langchain.schema import SystemMessage, HumanMessage
from sentence_transformers import CrossEncoder
from PyPDF2 import PdfReader
from pdf2image import convert_from_path
import pytesseract
# ---------------- CONFIG ----------------
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
EMBEDDING_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
RERANKER_MODEL_NAME = "cross-encoder/ms-marco-MiniLM-L-6-v2"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
RETRIEVE_K = 50
RERANK_TOP_N = 10
# --------------------------------------
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
_heading_re = re.compile(
r'^s*(?:#{1,6}s*|[A-Z][w -]{2,}n[-=]{2,})',
re.MULTILINE,
)
def sha1_text(text: str) -> str:
return hashlib.sha1(text.encode("utf-8")).hexdigest()
def load_pdf_text(path: str, ocr_if_needed: bool = True) -> List[Tuple[int, str]]:
"""
Returns [(page_number, text)].
Uses PyPDF2 first; falls back to OCR per-page if text is missing.
"""
pages: List[Tuple[int, str]] = []
try:
reader = PdfReader(path)
for i, page in enumerate(reader.pages):
text = page.extract_text() or ""
pages.append((i + 1, text.strip() or None))
except Exception as e:
logger.warning("PDF parse failed (%s). Falling back to OCR.", e)
pages = []
if ocr_if_needed and (not pages or any(t is None for _, t in pages)):
images = convert_from_path(path, dpi=200)
for i, img in enumerate(images):
if i >= len(pages) or pages[i][1] is None:
ocr_text = pytesseract.image_to_string(img)
if i < len(pages):
pages[i] = (i + 1, ocr_text)
else:
pages.append((i + 1, ocr_text))
return pages
def simple_chunk(text: str, page: int) -> List[Document]:
chunks = []
start = 0
while start < len(text):
end = min(len(text), start + CHUNK_SIZE)
chunk = text[start:end].strip()
if chunk:
chunks.append(Document(page_content=chunk, metadata={"page": page}))
start += CHUNK_SIZE - CHUNK_OVERLAP
return chunks
def smart_chunk_pages(pages: List[Tuple[int, str]]) -> List[Document]:
"""
Heading-aware chunking with deduplication.
"""
all_docs: Dict[str, Document] = {}
for page_num, text in pages:
if not text:
continue
headings = list(_heading_re.finditer(text))
spans = (
[m.start() for m in headings] + [len(text)]
if len(headings) >= 2
else [0, len(text)]
)
for i in range(len(spans) - 1):
segment = text[spans[i]:spans[i + 1]].strip()
for doc in simple_chunk(segment, page_num):
h = sha1_text(doc.page_content)
if h not in all_docs:
all_docs[h] = doc
return list(all_docs.values())
def build_vectorstore(docs: List[Document], save_path: str = None) -> FAISS:
embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME)
vectorstore = FAISS.from_documents(docs, embeddings)
if save_path:
vectorstore.save_local(save_path)
return vectorstore
def retrieve_and_rerank(
vectorstore: FAISS,
query: str,
reranker: CrossEncoder,
) -> List[Tuple[Document, float]]:
candidates = vectorstore.similarity_search(query, k=RETRIEVE_K)
if not candidates:
return []
pairs = [(query, d.page_content) for d in candidates]
scores = reranker.predict(pairs, batch_size=16)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return ranked[:RERANK_TOP_N]
def format_context(ranked: List[Tuple[Document, float]]) -> Tuple[str, List[str]]:
blocks, citations = [], []
for i, (doc, score) in enumerate(ranked):
cid = f"doc-{i+1}-p{doc.metadata.get('page')}"
excerpt = doc.page_content[:400].replace("n", " ").strip()
blocks.append(f"--- {cid} ---n{excerpt}")
citations.append(f"{cid} (score={score:.4f})")
return "nn".join(blocks), citations
def answer_with_llm(query: str, ranked: List[Tuple[Document, float]]) -> Dict[str, Any]:
system = (
"You are a strict summarizer. Use ONLY the provided context blocks. "
"If the answer is not present, reply NOT FOUND. "
"Include citation IDs and quote exact text using triple backticks."
)
context, citations = format_context(ranked)
user = f"Question:n{query}nnContext:n{context}"
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)
resp = llm([SystemMessage(content=system), HumanMessage(content=user)])
return {"answer": resp.content, "citations": citations}
def main(pdf_path: str):
pages = load_pdf_text(pdf_path)
docs = smart_chunk_pages(pages)
logger.info("Indexed %d chunks", len(docs))
vs = build_vectorstore(docs, save_path="faiss_index")
reranker = CrossEncoder(RERANKER_MODEL_NAME)
query = "How does the authentication workflow handle token refresh?"
ranked = retrieve_and_rerank(vs, query, reranker)
if not ranked:
print("No relevant context found.")
return
result = answer_with_llm(query, ranked)
print("n=== ANSWER ===n", result["answer"])
print("n=== SOURCES ===")
for c in result["citations"]:
print(c)
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--pdf", required=True, help="Path to PDF file")
args = parser.parse_args()
main(args.pdf)
Prompt pattern that consistently reduces hallucinations
System:
“You are a strict summarizer. Use ONLY the provided context blocks. If the answer is not present, reply NOT FOUND. Include source IDs and exact quotes when used.”
User:
Provide the question plus the top 3–5 re-ranked chunks (IDs + excerpts).
Assistant:
Concise answer with inline citations; include exact quotes in triple backticks when referencing sources.
Why: Forcing citation IDs and verbatim quotes makes answers auditable — stakeholders can verify claims in seconds.
What most guides miss (these matter more than model choice)
- OCR fallback — essential for real-world PDFs.
- Heading- and table-aware chunking — preserves semantic units.
- Deduplication — reduces redundant signals and cost.
- Metadata filters — apply date/source/type filters before reranking.
- Evaluation as a habit — track precision@k, hallucination rate, and p95 latency.
Bold insight: When chunking quality is low, rerankers can only choose the best of the bad. Fix ingestion first — this investment often beats a larger reranker.
Evaluation recipe (run this now)
- Pick three doc types: policy, spec, slides.
- Create 20 gold queries per type with exact ground-truth snippets.
- Baseline: bi-encoder top-5 → LLM answers. Record precision@5 and hallucination flags.
- Experiment: bi-encoder top-50 → rerank → top-5 → LLM. Compare metrics.
- Track p50/p95 latency and cost.
Success signal: precision@k improves by >10% and hallucinations drop meaningfully.
Deployment checklist
- Batch reranker calls; cache scores by (query, doc_id).
- Quantize rerankers for CPU or run via ONNX.
- Autoscale GPUs for spikes; keep a warm CPU fallback.
- Re-embed on embedding upgrades; version and backfill indexes.
- Mask or redact PII before third-party calls.
- Monitor precision@k, hallucination rate, p95 latency, and cost per 1K queries.
What actually matters
Many teams treat rerankers as a silver bullet. They are not. If chunking slices sentences or tables, the reranker will surface the best garbage. Fix ingestion (OCR, chunk boundaries, metadata) before investing in larger reranker models.
Quick model & tool cheat-sheet
- Extraction: PyPDF2, PyMuPDF, Unstructured.
- OCR: pdf2image + pytesseract.
- Embeddings: sentence-transformers/all-MiniLM-L6-v2 (fast baseline).
- Reranker: cross-encoder/ms-marco-MiniLM-L-6-v2.
- Vector DBs: FAISS (local), Chroma, Milvus, Weaviate, Pinecone.
- LLM: Use strict prompting and require citations.
The core idea
Make your pipeline accountable before you make it bigger. Reliable search over messy PDFs is not a model trick — it’s an engineering discipline: extract cleanly, chunk meaningfully, shortlist at scale, then re-rank precisely. Implement the two-stage retrieval + rerank flow, run the evaluation recipe, and you’ll convert frustrated searches into trustworthy answers.
❤️ If this shifted how you think about RAG Architecture, drop a clap 👏 and follow for more practical, battle-tested AI engineering insights.
Stop Wasting PDFs — Build a RAG That Actually Understands Them was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.