[P] Using YouTube as a data source (lessons from building a coffee domain dataset)
|
I started working on a small coffee coaching app recently – something that could answer questions around brew methods, grind size, extraction, etc. I was looking for good data and realized most written sources are either shallow or scattered. YouTube, on the other hand, has insanely high-quality content (James Hoffmann, Lance Hedrick, etc.), but it’s not usable out of the box for RAG. Transcripts are messy, chunking is inconsistent, getting everything into a usable format took way more effort than expected. So I made a small CLI tool that:
It basically became the data layer for my app, and funnily ended up getting way more traction than my actual coffee coaching app! Repo: youtube-rag-scraper submitted by /u/ravann4 |