[D] I’m building a synthetic data engine for Hinglish (Hindi+English) LLMs — but I’m stuck at a 0.69 quality score. Thoughts?
Hey
We speak of the “Data Wall,” but for Indian languages, it’s a data abyss. Hinglish corpora are small, toxic-scraped, or lose the Indian flavor after translation.
I’m working on a pipeline for the generation of privacy-preserving synthetic Hinglish conversational data.
Pipeline:
-Seed: 35k real Hinglish conversations (quality: 98.67) -Architecture: GaussianCopula + custom speaker oversampling
Goal: scale minority dialects while maintaining code-mix patterns
Reality check (10k rows):
Privacy: AUC 0.95 (membership inference)
Quality: 0.6897 (target ≥ 0.75)
Word counts are consistent, but the pattern falls apart after oversampling the minority speakers
Questions
-
For 7B-14B models, is ~0.69 similarity sufficient if domain logic is sound?
-
Are statistical synthesizers adequate for Hinglish conversation data, or does an LLM-in-the-loop method only work?
-
Would startups be interested in data certificates (quality, privacy, diversity), or just pure volume?
Building this under Forge to minimize dependence on Western-centric corpora.
Frankly, is it worth improving, or is statistical synthesis a dead end for conversational LLM data?
submitted by /u/Big_Airline7132
[link] [comments]