[D] Designing a crawler that produces ready markdown instead of raw HTML
When building RAG pipelines and agent systems, I kept running into the same issue:
most web crawlers return raw HTML or noisy text that still requires significant post-processing before it’s usable for embeddings.
I’ve been experimenting with a crawler design that focuses specifically on AI ingestion, not generic scraping. The key design choices are:
- isolating main content on docs-heavy sites (removing nav, footers, TOCs)
- converting pages into structure-preserving markdown
- chunking by document hierarchy (headings) instead of fixed token windows
- generating stable content hashes to support incremental updates
- emitting an internal link graph alongside the content
The goal is to reduce downstream cleanup in RAG pipelines and make website ingestion more deterministic.
I’m curious how others here are handling:
- content deduplication across large docs sites
- chunking strategies that preserve semantic boundaries
- change detection for continuously updated documentation
Happy to share implementation details or benchmarks if useful — mostly looking for critique or alternative approaches from people working on similar systems.
– https://apify.com/devwithbobby/docs-markdown-rag-ready-crawler
submitted by /u/rgztmalv
[link] [comments]