Towards High-Quality Machine Translation for Kokborok: A Low-Resource Tibeto-Burman Language of Northeast India

digitado ⋅ 31 de March de 2026

We present KokborokMT, a high-quality neural machine translation (NMT) system for Kokborok (ISO 639-3; trp), a Tibeto-Burman language spoken primarily in Tripura, India with approximately 1.5 million speakers. Despite its status as an official language of Tripura, Kokborok has remained severely under-resourced in the NLP community, with prior machine translation attempts limited to systems trained on small Bible-derived corpora achieving BLEU scores below 7. We fine-tune the NLLB-200-distilled-600M model on a multi-source parallel corpus comprising 36,052 sentence pairs: 9,284 professionally translated sentences from the SMOL dataset, 1,769 Bible-domain sentences from WMT shared task data, and 24,999 synthetic back-translated pairs generated via Gemini Flash from Tatoeba English source sentences. We introduce trp_Latn as a new language token for Kokborok in the NLLB framework. Our best system achieves BLEU scores of 17.30 (en→trp) and 38.56 (trp→en) on held-out test sets, representing substantial improvements over prior published results. Human evaluation by three annotators yields mean adequacy of 3.74/5 and fluency of 3.70/5, with substantial agreement between trained evaluators (κ = 0.67). We will release the model and code publicly under CC-BY-4.0 upon acceptance.

Like 0

Liked Liked