[P] Structured Prompting for Extremely Low-Resource Languages: 80% → 5% Vocabulary Contamination, No Fine-Tuning
Most low-resource language research assumes you can fine-tune. But what happens when a language has ~2M speakers, no official script standardization, near-zero web presence, and you’re working with a frozen model?
We ran into this with Tulu, a Dravidian language from coastal Karnataka, India. The core failure mode is consistent across models, i.e, a prompt in Tulu, get Kannada back. The models aren’t hallucinating randomly, instead they’re collapsing to the nearest high-probability neighbor in the training distribution. Vocabulary contamination in baseline outputs was sitting at ~80%.
Our approach: a 5-layer structured prompt
Rather than treating this as a retrieval or fine-tuning problem, we decomposed the prompt into explicit layers:
- Phonological grounding: Tulu’s retroflex consonants and vowel length distinctions injected directly
- Morphological rules: agglutinative verb structure, case markers, with contrastive Kannada examples
- Negative constraints: explicitly suppressing high-frequency Kannada lexical bleed (e.g., ಇದೆ → ಉಂಡು)
- Romanization standardization: since Tulu has no dominant script, we needed a consistent transliteration anchor
- Self-play synthetic examples: quality-controlled in-context demonstrations generated via iterative model critique
Results (validated by native speakers):
- Vocabulary contamination: 80% → 5%
- Grammatical accuracy: 85%
- Tested across GPT-4o, Gemini 2.0 Flash, Llama 3.1 70B
What’s interesting (and unresolved):
The negative constraint layer did more work than we expected, which is, more than the grammar documentation alone. This raises a question we don’t fully answer: is the model actually “learning” Tulu grammar from the prompt, or is it primarily doing constrained Kannada generation with lexical substitution? Native speaker evals suggest real grammar is being respected, but we can’t rule out the latter cleanly.
Also worth noting: the self-play loop was surprisingly sensitive to the critique prompt. Small changes in the evaluator instruction shifted output quality significantly, which suggests the synthetic data quality is bottlenecked by how well you can specify “correct Tulu” to a model that doesn’t natively know it which is kind of a bit of a bootstrapping problem.
Open questions for discussion:
- Does the negative-constraint approach generalize to other language pairs with similar asymmetric resource distributions (e.g., Maithili/Hindi, Scots/English)?
- Is there a principled way to measure “prompt-induced grammar acquisition” vs. constrained generation from a related language?
- At what point does structured prompting hit a ceiling where fine-tuning on even a small curated corpus would dominate?
Paper: https://arxiv.org/abs/2602.15378v1
Blog (more accessible writeup): https://letters.lossfunk.com/p/making-large-language-models-speak
submitted by /u/GrowthExciting1126
[link] [comments]