I built an open, from-scratch MT pipeline + parallel corpus for Tunisian Darija (Arabizi) early baseline, and I’m growing it into a curated community corpus [P]

I’m an 18-year-old independent student from Tunisia. I built and I’m leading an open, from-scratch machine-translation pipeline and parallel corpus for Tunisian Darija. Sharing it for feedback.

Why: Tunisian Darija, written in Arabizi (Latin letters + numerals like 3/7/9/5 for Arabic phonemes), has almost no open NLP resources. Existing Arabic tools route it through MSA and mishandle the orthography. To the best of my knowledge there was no open parallel

corpus or from-scratch baseline for it.

What I built (all open):

– Arabizi-aware SentencePiece BPE tokenizer (3/7/9/5 as protected symbols), shared 16k vocab.

– ~15.6M-param encoder–decoder Transformer, from scratch (no pretrained LM): transfer-learned from cleaned Moroccan Darija, then fine-tuned on hand-crafted Tunisian pairs.

– Full cleaning / training / eval pipeline.

Honest results & limitations: v1 BLEU is 3.89 on a small locked test set low, and I’ll be upfront about it. The corpus is ~553 hand-crafted pairs, so data is the bottleneck, not architecture. I treat 3.89 as a first honest baseline to beat as the corpus grows.

Where I’m taking it: I’m expanding this into a larger, ethically-collected Darija corpus that I curate and validate consent-documented field collection, every pair provenance-tagged. I’m looking for contributors to help grow it, with every contribution reviewed

to keep quality and consent standards.

Looking for: technical feedback/critique, and anyone interested in contributing data or collaborating on low-resource / dialectal Arabic MT.

Links:

github repo: https://github.com/Dhiadev-tn/darija-translator

Hugging faces dataset: https://huggingface.co/datasets/Dhiadev-tn/tunisian-darija-english

hugging faces model: https://huggingface.co/Dhiadev-tn/darija-translator

submitted by /u/Dhiadev-tn
[link] [comments]

Liked Liked