[D] I’m building a synthetic data engine for Hinglish (Hindi+English) LLMs — but I’m stuck at a 0.69 quality score. Thoughts?
Hey We speak of the “Data Wall,” but for Indian languages, it’s a data abyss. Hinglish corpora are small, toxic-scraped, or lose the Indian flavor after translation. I’m working on a pipeline for the generation of privacy-preserving synthetic Hinglish conversational data. Pipeline: -Seed: 35k real Hinglish conversations (quality: 98.67) -Architecture: GaussianCopula + custom speaker oversampling Goal: scale minority dialects while maintaining code-mix patterns Reality check (10k rows): Privacy: AUC 0.95 (membership inference) Quality: 0.6897 (target ≥ 0.75) Word […]