[D] I’m building a synthetic data engine for Hinglish (Hindi+English) LLMs — but I’m stuck at a 0.69 quality score. Thoughts?

Hey

We speak of the “Data Wall,” but for Indian languages, it’s a data abyss. Hinglish corpora are small, toxic-scraped, or lose the Indian flavor after translation.

I’m working on a pipeline for the generation of privacy-preserving synthetic Hinglish conversational data.

Pipeline:

-Seed: 35k real Hinglish conversations (quality: 98.67) -Architecture: GaussianCopula + custom speaker oversampling

Goal: scale minority dialects while maintaining code-mix patterns

Reality check (10k rows):

Privacy: AUC 0.95 (membership inference)

Quality: 0.6897 (target ≥ 0.75)

Word counts are consistent, but the pattern falls apart after oversampling the minority speakers

Questions

  1. For 7B-14B models, is ~0.69 similarity sufficient if domain logic is sound?

  2. Are statistical synthesizers adequate for Hinglish conversation data, or does an LLM-in-the-loop method only work?

  3. Would startups be interested in data certificates (quality, privacy, diversity), or just pure volume?

Building this under Forge to minimize dependence on Western-centric corpora.

Frankly, is it worth improving, or is statistical synthesis a dead end for conversational LLM data?

submitted by /u/Big_Airline7132
[link] [comments]

Liked Liked