How to Retrieve Anything. Fast.

digitado ⋅ 7 de June de 2026

A Production Architecture for Financial Document Search at Scale

Assumptions going in: You know what RAG is. You know its failure modes — context window pollution, semantic gap, retrieval drift, stale embeddings. You’ve hit the wall where naive vector search works in a demo and breaks in production. This article is about what comes after that wall, specifically in the context of the financial industry, where the data is messier, the stakes are higher, and the retrieval requirements are more demanding than most other domains.

The Problem Space: Financial Data Is Not Clean Text

A financial services company — asset manager, bank, insurance firm, fintech — has a corpus that looks roughly like this:

PDF-heavy: 10-K/10-Q/8-K filings, audit reports, research notes, policy documents, contracts, compliance manuals
Spreadsheets: Excel models, CSV exports, account statements, reconciliation sheets
Presentations: PowerPoint earnings decks, investor briefings with charts, embedded commentary
Receipts and invoices: Scanned PDFs, sometimes photographed, sometimes multi-page with stamps, handwriting, and low DPI
Unstructured text: Email threads, analyst notes, internal memos in DOCX or plain text

The challenge isn’t just retrieval. It’s everything upstream of retrieval. A scanned invoice that gets fed through a naive pypdf extractor becomes corrupted text — table rows collapse into comma-separated garbage, column headers merge with values, and multi-line fields lose their structure. If that’s what you’re embedding, your vector index is poisoned before a single query is run.

So the architecture has two distinct problems, and they need to be solved independently:

Ingestion quality — parsing, extracting structure, and chunking in a way that preserves semantic integrity
Retrieval speed and accuracy — finding the right chunk fast, with provenance, at scale

This article covers both, in order, because you cannot have a fast and accurate retrieval system if your index contains garbage.

Part 1: Ingestion — Parse Before You Index

The Parser Decision

Most teams default to pypdf or PDFMiner. Both are fine for clean, text-native PDFs with linear reading order. They both fail badly on:

Multi-column layouts
Tables with merged cells or missing borders
Scanned/image-based PDFs with any OCR requirement
PDFs with embedded charts where the caption is spatially separated from the figure

For a financial corpus, these failure cases are not edge cases. They are the norm.

Docling (IBM Research, now Linux Foundation, MIT license) is the current open-source ceiling for document parsing in RAG pipelines. It uses computer vision models to understand page layout before doing text extraction — it identifies element types (title, paragraph, table, figure, list, caption) and their spatial relationships, then reconstructs reading order and structure. The output is a DoclingDocument object — a Pydantic model that keeps text, tables, images, headings, and layout metadata together rather than flattening everything into a text blob.

Critically, it does not treat a table as text. A table is extracted as a structured object with row/column relationships intact.

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("q3_earnings_report.pdf")

doc = result.document
# Tables come out as structured objects, not linearized text
for table in doc.tables:
    df = table.export_to_dataframe()  # pandas DataFrame, structure preserved
    print(df)

# Export full doc to markdown for text chunks
markdown_text = doc.export_to_markdown()

IBM also released Granite-Docling (January 2026) — a 258M parameter VLM that does the full parse in one shot, including OCR within identified elements and linking captions to their figures. It’s worth evaluating for environments where running Docling’s ensemble pipeline (multiple models) is too slow for batch ingestion.

For scanned receipts and invoices with low quality or non-standard layouts, the preprocessing step matters: de-skewing, contrast enhancement, and resolution normalization before any parser sees the image. Confidence scoring on extracted fields is essential — you want corrupted extractions flagged for human review rather than silently poisoning your index.

The Chunking Decision

This is where most production systems make mistakes that are invisible until retrieval fails.

The wrong approach: fixed-size token chunking applied uniformly across all document types.

Fixed-size chunking at 256 or 512 tokens is fine for homogeneous prose. Financial documents are not homogeneous. The same PDF can have:

Dense narrative text (risk factor disclosures) where 512-token chunks work well
Financial tables (income statement, balance sheet) where table rows need to be kept together with their headers
Charts with captions that need to be stored paired together
Section headers that provide critical context for the paragraphs below them

What the research says: A 2025 NVIDIA study on financial document retrieval (using FinanceBench as the evaluation corpus) found that factoid queries perform best with 256–512 token chunks, while complex analytical queries benefit from page-level or 1024-token chunks. This isn’t a reason to pick one — it’s a reason to implement per-document-type chunking strategies.

The practical approach:

Document type → Chunking strategy

SEC filings (10-K, 10-Q)     → Section-level (split on headings), ~512-1024 tokens
                                Tables chunked separately as structured objects
                                Table header repeated in every table chunk

Earnings presentations (PPT) → Slide-level chunks
                                Each slide = one chunk, title + body + any chart caption

Receipts / invoices          → Whole-document chunk (receipts are usually short)
                                Key fields extracted as metadata: vendor, date, amount, currency

Spreadsheets (Excel/CSV)     → Row-group chunking for tabular data
                                Schema stored as metadata per chunk
                                Numeric columns normalized to strings with units preserved

Contracts (DOCX)             → Clause-level chunking
                                Clause header + clause body kept together
                                Cross-references stored as metadata links

Research notes (free text)   → Semantic chunking (sentence transformer-based)
                                ~256-512 tokens, 20% overlap

For tables specifically: don’t flatten them to text. Store the structured representation (Markdown table or JSON object) as the chunk content. A table embedded in a 10-K that shows quarterly revenue should be retrievable by “Q3 revenue” even if those words don’t appear verbatim in the table — which means the table’s semantic embedding needs to capture its content, not garbled linearized text.

One effective approach: generate a natural language summary of each table using a small LLM (or Granite-Docling’s built-in captioning), store the summary as the chunk text for embedding, and keep the raw structured table as a reference artifact that gets returned to the LLM at generation time.

# Table handling pattern
table_summary = llm.generate(
    f"Summarize this financial table in 2-3 sentences:n{table.export_to_markdown()}"
)

chunk = {
    "chunk_id": f"{doc_id}_table_{idx}",
    "content": table_summary,          # What gets embedded for retrieval
    "raw_content": table.export_to_markdown(),  # What gets sent to LLM
    "metadata": {
        "doc_type": "10-K",
        "element_type": "table",
        "page": table.page_no,
        "section": parent_section_heading,
        "source_file": filename,
        "date": doc_date,
        "company": company_name,
        "fiscal_period": fiscal_period
    }
}

The metadata schema is not optional. It’s the foundation for pre-filtering, which is covered in Part 2.

Part 2: Retrieval Architecture

With a correctly parsed and structured index, the retrieval problem becomes solvable. The architecture choices below are ordered by impact on latency and accuracy.

Layer 1: Hybrid Search (Dense + Sparse) — Non-Negotiable

Pure vector search has a known failure mode: it’s good at semantic similarity but bad at exact term matching. In financial retrieval, exact terms matter enormously. “IFRS 17,” “Basel III Pillar 2,” “Q3 FY2024,” “Article 23(1)(a)” — these are not paraphraseable. If your query contains a specific regulatory reference and the exact document mentioning it ranks 15th because it has lower semantic similarity to the query embedding than 14 other documents, your retrieval has failed.

The fix is hybrid search: run dense vector search and sparse BM25 in parallel, then fuse results using Reciprocal Rank Fusion (RRF).

RRF formula:

RRF_score(d) = Σ 1 / (k + rank_i(d))

Where k is a smoothing constant (typically 60) and rank_i(d) is the document’s rank in each result list. RRF is preferred over weighted linear combination because it doesn’t require calibrating scores between two differently-scaled systems.

Empirically (from the SIGIR 2025 LiveRAG Challenge), hybrid retrieval achieves the highest recall at depth — specifically at R@50 and beyond, which matters because you’re feeding a reranker downstream, not taking the raw top-5.

A 2026 benchmark (aimultiple.com, April 2026) measured hybrid search (BM25 + dense) against dense-only retrieval and found MRR improved from 0.410 to 0.486 — an 18.5% improvement in the likelihood that the best answer appears in the top position.

For financial queries, a 2025 paper evaluating RAG on SEC filings (arXiv:2511.18177) confirmed this: vector-based agentic RAG with hybrid search and metadata filtering achieved a 68% win rate over hierarchical node-based systems that don’t use embeddings at all.

Implementation with Qdrant (supports native hybrid search):

from qdrant_client import QdrantClient
from qdrant_client.models import SparseVector, NamedSparseVector, NamedVector

client = QdrantClient(host="localhost", port=6333)

# At query time: run hybrid search
results = client.query_points(
    collection_name="financial_docs",
    prefetch=[
        # Dense retrieval leg
        models.Prefetch(
            query=dense_query_vector,  # from your embedding model
            using="dense",
            limit=50
        ),
        # Sparse retrieval leg (BM25 / SPLADE)
        models.Prefetch(
            query=models.SparseVector(indices=sparse_indices, values=sparse_values),
            using="sparse",
            limit=50
        ),
    ],
    # RRF fusion
    query=models.FusionQuery(fusion=models.Fusion.RRF),
    limit=50  # Top 50 for the reranker
)

Elasticsearch (with ELSER for sparse) and Weaviate (native hybrid) are alternatives. Qdrant’s RRF implementation in a single query call is cleaner operationally.

Layer 2: Metadata Pre-Filtering — Reduce the Search Space Before ANN

Pre-filtering applies metadata constraints before the ANN search executes. This narrows the candidate vector pool, which reduces both latency and noise. The key is that pre-filtering only works if your metadata is indexed and queryable efficiently.

For financial data, the natural pre-filters are:

# Example: Find Q3 2024 revenue figures for TechCorp from their 10-K filing
filters = {
    "must": [
        {"key": "company", "match": {"value": "TechCorp"}},
        {"key": "fiscal_period", "match": {"value": "Q3_2024"}},
        {"key": "doc_type", "match": {"any": ["10-K", "10-Q"]}}
    ],
    "should": [
        {"key": "element_type", "match": {"value": "table"}}
    ]
}

The tradeoff: pre-filtering reduces the candidate pool, which can hurt recall if filters are too aggressive or if your metadata schema is inconsistent. The failure mode is when a highly relevant chunk is excluded because of a metadata mismatch — e.g., a table stored with fiscal_period: “2024-Q3” instead of “Q3_2024”. Normalize your metadata schema at ingestion time, not at query time.

Post-filtering (retrieve top-k from full index, then filter by metadata) avoids this but is more expensive and can leave you with fewer than k results after filtering. The practical compromise: pre-filter on high-cardinality, reliable metadata (document type, company, date range), and leave the rest to the reranker.

Layer 3: Vector Index Selection — HNSW at This Scale

For a financial firm with tens of millions of document chunks (a reasonable number when you’re processing decades of filings, contracts, and research), the indexing choice matters.

HNSW is the right default for datasets under roughly 100M vectors where you can afford the memory. It operates in O(log n) query time, handles high-dimensional vectors better than IVF (which relies on good cluster structure that high-dimensional data often lacks), and tolerates data distribution drift — relevant for a financial corpus where new document types get added regularly.

Benchmarks show HNSW delivering sub-10ms latency on millions of vectors with 95%+ recall at ef_search values tuned for your workload. Going from 0.80 to 0.95 recall increases HNSW latency by roughly 31% — a generally acceptable tradeoff.

When to deviate:

100M+ vectors: HNSW keeps the full graph in RAM. At 100M 1536-dim vectors (float32), that’s ~600GB. IVF-PQ with product quantization compresses this to ~75–150GB at 4–8x compression while maintaining 95%+ recall. Use IVF-PQ at billion scale.
Tiered data: Hot data (current quarter filings, recent news) stays in HNSW in memory for fast access. Cold data (archived historical filings) moves to IVF-PQ or DiskANN (SSD-backed).

Dataset size     Recommended index
< 1M vectors     HNSW (ef=200, M=16)
1M - 100M        HNSW (ef=128-256, M=16-32), tune ef_search per latency budget
100M - 1B        IVF-PQ or DiskANN
> 1B             Distributed Milvus or Qdrant cluster with IVF-PQ shards

Layer 4: Cross-Encoder Reranking — The Accuracy Multiplier

After hybrid search returns top-50 candidates (not top-5 — always retrieve more than you need), a cross-encoder reranker scores each query-chunk pair. Unlike bi-encoders that produce independent embeddings, a cross-encoder sees the full concatenated query + chunk text and scores their relevance jointly. This is significantly more accurate at the cost of compute.

From the financial domain specifically (arXiv:2511.18177): cross-encoder reranking achieves a 59% absolute improvement in MRR@5 at optimal parameters on SEC filing retrieval. That’s not a marginal improvement — it’s the difference between a system that sometimes finds the right document and one that reliably does.

The latency cost is real: on CPU, scoring 50 candidates with ms-marco-MiniLM-L6-v2 takes 100-300ms depending on chunk length. On a single GPU, under 50ms. This is the right tradeoff for financial queries where correctness matters more than shaving 200ms.

Options in 2026:

Reranker ELO (Agentset, Feb 2026) Latency Cost Notes Zerank-2 1638 ~200ms API, pay-per-use Current accuracy leader Cohere Rerank v4.0 Pro 1629 ~600ms API, pay-per-use 32K context window, strong on finance BGE-reranker-v2-m3–50–100ms GPU Free (self-host) Apache 2.0, multilingual, no data leaves your infra ms-marco-MiniLM-L6-v2–100–300ms CPU Free (self-host) Lightest option, good baseline

For a financial firm with data residency or air-gap requirements, BGE-reranker-v2-m3 self-hosted on a single GPU is the right choice. For external-facing products where accuracy ceiling matters most, Zerank-2 or Cohere Rerank v4 Pro.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")

# Input: top-50 from hybrid search
candidates = [(query, chunk["content"]) for chunk in hybrid_results[:50]]
scores = reranker.predict(candidates)

# Re-sort by reranker score
reranked = sorted(zip(hybrid_results, scores), key=lambda x: x[1], reverse=True)
top_k = [item[0] for item in reranked[:5]]  # Top-5 for LLM context

Layer 5: Semantic Caching — The Cheapest Query Is One That Doesn’t Run

In financial workflows, query patterns repeat. “What is the revenue for Q3 2024?” and “Show me Q3 FY24 revenues” are semantically identical. Running the full pipeline — hybrid search, reranker, LLM generation — for both is wasteful.

Semantic caching works by embedding incoming queries and checking cosine similarity against a cache of past queries. If similarity exceeds a threshold (typically 0.90–0.95 for financial queries where precision matters), return the cached response.

Measured results from production pipelines (Catchpoint benchmarks, 2025–2026): retrieval time drops from ~6500ms to ~1900ms on cache hits, with exact matches returning in 53ms. Redis LangCache reports up to 73% cost reduction in high-repetition workloads. For financial dashboards or portals where the same 20% of questions make up 80% of volume, this is significant.

Threshold tuning for finance: Financial queries are precise. “What is Q3 revenue?” and “What is Q3 profit?” are semantically close but not equivalent. Set your similarity threshold at 0.92–0.95 and measure false hit rates on a validation set before deploying.

import redis
import numpy as np
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("BAAI/bge-small-en-v1.5")
cache_client = redis.Redis(host="localhost", port=6379)

def get_cached_response(query: str, threshold: float = 0.93):
    query_emb = encoder.encode(query)

    # Check cache (Redis with vector search, or Qdrant collection)
    cached_queries = cache_client.lrange("query_cache", 0, -1)
    for entry in cached_queries:
        cached = json.loads(entry)
        similarity = np.dot(query_emb, cached["embedding"]) / (
            np.linalg.norm(query_emb) * np.linalg.norm(cached["embedding"])
        )
        if similarity >= threshold:
            return cached["response"]  # Cache hit
    return None  # Cache miss, run full pipeline

The Full Architecture End-to-End

INGESTION PIPELINE (async, batch)
────────────────────────────────
Raw files (PDF, DOCX, PPTX, XLSX, CSV)
        │
        ▼
[Docling / Granite-Docling]
  - Layout detection
  - OCR (scanned PDFs)
  - Table structure extraction → DataFrame
  - Figure caption linking
        │
        ▼
[Document-type router]
  - SEC filing → section-level chunking
  - Presentation → slide-level chunking
  - Spreadsheet → row-group chunking
  - Receipt/Invoice → whole-doc + metadata extraction
  - Contract → clause-level chunking
        │
        ▼
[Chunk + metadata tagging]
  - Tables: LLM-generated summary as embedding text, raw table as reference
  - All chunks: company, doc_type, date, fiscal_period, section, element_type, page
        │
        ▼
[Embedding: BGE-M3 or OpenAI text-embedding-3-large]
[Sparse: SPLADE or BM25 via BM25S]
        │
        ▼
[Qdrant — HNSW dense collection + sparse collection]
  - Named vectors: "dense" (1536-dim), "sparse" (SPLADE)
  - Payload: full metadata schema
  - Index: payload index on company, doc_type, date

QUERY PIPELINE (real-time, <500ms P95 target)
─────────────────────────────────────────────
User query
        │
        ▼
[Semantic cache check] ──── Hit ──→ Return cached response (~53ms)
        │ Miss
        ▼
[Query rewriting / HyDE optional]
  - For ambiguous queries: generate hypothetical document excerpt (HyDE)
  - For compound queries: decompose into sub-queries
        │
        ▼
[Metadata filter construction]
  - LLM or rules extract: company, date range, doc_type from query
  - Build Qdrant filter payload
        │
        ▼
[Parallel hybrid search]
  - Dense ANN (HNSW) with metadata pre-filter → top-50
  - Sparse BM25/SPLADE with metadata pre-filter → top-50
  - RRF fusion → top-50 candidates
        │
        ▼
[Cross-encoder reranker]
  - BGE-reranker-v2-m3 (self-hosted GPU) or Zerank-2 (API)
  - Input: top-50 candidates
  - Output: top-5, reranked by relevance
        │
        ▼
[Context assembly]
  - For table chunks: swap embedding text for raw structured table
  - Attach source metadata: filename, page, section, date
  - Format: "Source: [10-K, TechCorp, Q3 2024, p.47]n{content}"
        │
        ▼
[LLM generation with grounded context]
  - GPT-4o / Claude 3.5 Sonnet / Gemini 1.5 Pro
  - Structured output: answer + citations
        │
        ▼
[Cache write] — store query embedding + response
        │
        ▼
Response with sources

Latency Budget: Where Time Goes

For a target P95 of under 500ms end-to-end (excluding LLM generation):

Step Typical latency Optimization handle Semantic cache lookup 5–53ms Redis in-memory, FAISS for cache index Query embedding 10–30ms Small model (BGE-small), GPU, batching Metadata filter construction 20–50ms Rules > LLM extraction for known patterns Hybrid search (dense + sparse, parallel) 20–80ms HNSW ef_search tuning, pre-filter RRF fusion <5ms In-memory operation Cross-encoder reranking (50 candidates) 50–300ms GPU deployment, smaller model for speed Context assembly <10ms Total (excluding LLM) ~150–500ms

The reranker is your biggest variable. On CPU with MiniLM-L6-v2 scoring 50 short candidates, you’re looking at 100–200ms. On a GPU, under 50ms. For latency-critical applications (sub-200ms), use FlashRank (CPU-optimized, lightweight models) or BGE-reranker on GPU. Accept that the accuracy ceiling is lower.

What the Research Shows for Financial Domains Specifically

Three data points worth anchoring on:

1. Hybrid search over pure dense for financial QA (arXiv:2511.18177, 2025): Vector-based agentic RAG with hybrid search and metadata filtering achieved 68% win rate over hierarchical non-embedding approaches on a 150-question benchmark across 1,200 SEC filings. Cross-encoder reranking alone achieved a 59% absolute improvement in MRR@5.

2. Small-to-big retrieval (same paper): Retrieving small chunks (for precision) then expanding to the surrounding larger context (for completeness) before generation achieved a 65% win rate over standard chunking with only 0.2 seconds of additional latency. Implementation: store child chunks for retrieval, parent chunks for generation.

3. Section-level vs. page-level chunking on financial PDFs (NVIDIA, 2025): Section-level chunking using Docling’s structure-aware segmentation outperformed token-based chunking across FinanceBench and earnings report datasets. The key variable was whether table headers were preserved in their table chunks — pipelines that separated table headers from their rows consistently underperformed.

The Honest Tradeoffs

Accuracy vs. latency: Every layer you add improves accuracy and adds latency. The reranker is the biggest lever on both. If you’re building a real-time autocomplete, skip the reranker and compensate with better chunking and a higher retrieval k. If you’re building a compliance Q&A system where wrong answers have legal consequences, run the full stack.

Self-hosted vs. managed: Self-hosted Qdrant + BGE-reranker gives you data residency, no per-query API costs, and full control. It also requires you to manage infrastructure, handle index rebuilds, and tune parameters. Managed Pinecone/Weaviate + Cohere Rerank reduces operational overhead but puts sensitive financial data through third-party APIs and adds per-query costs that compound quickly at scale.

HNSW memory vs. IVF-PQ recall: If your corpus is under 50M vectors and you can afford the RAM, HNSW is the right call — better recall, better latency, simpler to operate. If you’re budget-constrained or at billion scale, IVF-PQ with careful nlist/nprobe tuning (a typical config: search 10 of 1000 clusters, ~1% of vectors, maintaining 90%+ recall) is viable but requires more careful benchmarking on your actual data distribution.

Semantic cache precision: In financial retrieval, a false cache hit — returning an answer for “Q3 FY2023 revenue” when the query was “Q3 FY2024 revenue” — is worse than a cache miss. Set your similarity threshold conservatively (0.93+), validate on a holdout set, and build a cache invalidation mechanism for when underlying documents are updated.

Stack Summary

Component Recommended (self-hosted, air-gap safe) Recommended (managed, simplicity priority) Document parser Docling (MIT) / Granite-Docling Unstructured.io API Table extraction Docling → DataFrame Unstructured.io hi-res mode Sparse index BM25S (pure Python, fast) or SPLADE Elasticsearch ELSER Dense index + vector DB Qdrant (HNSW, native hybrid) Pinecone or Weaviate Embedding model BGE-M3 (1024-dim, multi-lingual) OpenAI text-embedding-3-large Sparse embedding SPLADE-v3 or BM25S ELSER (Elastic) Reranker BGE-reranker-v2-m3 (GPU) Cohere Rerank v4 Pro or Zerank-2 Semantic cache GPTCache + Redis Redis LangCache Orchestration LangGraph (agentic flows) or LlamaIndex LlamaIndex Generator GPT-4o / Claude 3.5 Sonnet Same

Resources (All Verified, Primary Sources)

Docling paper: https://arxiv.org/abs/2501.17887
Granite-Docling announcement: https://www.ibm.com/new/announcements/granite-docling-end-to-end-document-conversion
Financial RAG evaluation (SEC filings benchmark): https://arxiv.org/abs/2511.18177
NVIDIA chunking study (FinanceBench): https://developer.nvidia.com/blog/finding-the-best-chunking-strategy-for-accurate-ai-responses/
SIGIR 2025 LiveRAG Challenge (hybrid retrieval benchmarks): https://arxiv.org/pdf/2506.15246
GPT Semantic Cache paper (arXiv:2411.05276): https://arxiv.org/abs/2411.05276
Qdrant hybrid search docs: https://qdrant.tech/documentation/concepts/hybrid-queries/
BGE-reranker-v2-m3: https://huggingface.co/BAAI/bge-reranker-v2-m3
Agentset reranker benchmark (Feb 2026): https://agentset.ai/rerankers
ColPali for visually-rich PDFs: https://arxiv.org/abs/2407.01449

How to Retrieve Anything. Fast. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Like 0

Liked Liked