[D] Designing a crawler that produces ready markdown instead of raw HTML

digitado ⋅ 11 de January de 2026

When building RAG pipelines and agent systems, I kept running into the same issue:
most web crawlers return raw HTML or noisy text that still requires significant post-processing before it’s usable for embeddings.

I’ve been experimenting with a crawler design that focuses specifically on AI ingestion, not generic scraping. The key design choices are:

isolating main content on docs-heavy sites (removing nav, footers, TOCs)
converting pages into structure-preserving markdown
chunking by document hierarchy (headings) instead of fixed token windows
generating stable content hashes to support incremental updates
emitting an internal link graph alongside the content

The goal is to reduce downstream cleanup in RAG pipelines and make website ingestion more deterministic.

I’m curious how others here are handling:

content deduplication across large docs sites
chunking strategies that preserve semantic boundaries
change detection for continuously updated documentation

Happy to share implementation details or benchmarks if useful — mostly looking for critique or alternative approaches from people working on similar systems.

– https://apify.com/devwithbobby/docs-markdown-rag-ready-crawler

submitted by /u/rgztmalv
[link] [comments]

Like 0

Liked Liked