[P] Dataset creation tool with intelligent quality filtering for LLM fine-tuning [Open Source]

I’ve been working on improving fine-tuning workflows and realized data collection is where most people struggle. Created a tool to automate this.

Web scraping is easy. Getting useful** training data is hard. Most scraped content is navigation, ads, boilerplate, or just low-quality writing.

Built a scoring system that evaluates content on 6 factors:

– Information density (tutorials, explanations vs fluff)

– Educational value (technical depth)

– Structure quality (proper formatting, headers, lists)

– Noise filtering (removes ads, navigation)

– Length optimization (sweet spot is 800-5000 chars)

– URL patterns (blog posts, articles vs home pages)

Additional features:

– Content-type specific extraction (recipes have different structure than docs)

– Multi-threaded crawling with rate limiting

– Configurable depth (crawl seed pages only vs follow links 2-3 levels deep)

– Chat template formatting for popular model families

– Can process GitHub repos and local codebases

Use case: Scraped Python documentation, set quality threshold to 75, got ~2,000 high-quality examples. Fine-tuned Llama 3.2 3B with LoRA, ended up with a model that’s surprisingly good at Python-specific questions.

Repo: https://github.com/noosed/NTCompanion

Built with Python, uses DearPyGUI for the interface. Supports Llama, Mistral, Qwen, Phi, and Gemma chat templates out of the box. Entirely Open-Source and will stay that way!

submitted by /u/Muted_Impact_9281
[link] [comments]

Liked Liked