MINT: A Multilingual Indic Neural Transformer for Abstractive Summarization Under Memory Constraints

digitado ⋅ 14 de April de 2026

We present MINT (Multilingual Indic Neural Transformer), a compact 14.7M parameter encoder-decoder architecture for abstractive summarization across seven Indic languages. MINT is designed specifically to operate within the memory envelope of a single commodity NVIDIA T4 GPU (15 GB VRAM), addressing the paradox in which models serving the most resource-constrained communities are themselves the most resource-intensive to deploy. The architecture incorporates Rotary Position Embeddings (RoPE), SiLU feed-forward activations, DropPath regularization, weight tying, and a custom 32,000-token SentencePiece Unigram tokenizer trained over balanced Indic corpora. Training proceeds in two phases on the XL-Sum BBC dataset across Hindi, Bengali, Marathi, Tamil, Telugu, Punjabi, and Urdu: a fluency phase (epochs 1-15) using linear warmup with cosine decay, followed by a refinement phase (epochs 16-25) with a flat low learning rate and a combined coverage-attention entropy loss that jointly penalizes repetition and hallucination. We conduct the first identical-regime comparison in Indic summarization, fine-tuning both IndicBART (440M parameters) and mT5-small (556M parameters) under the same loss function, optimizer, decoding strategy, and data pipeline as MINT’s refinement phase. On the XL-Sum test set, MINT achieves an average ROUGE-1 of 0.1187 at epoch 15, rising to 0.1302 on validation after full refinement, reaching approximately 84.8% of IndicBART’s ROUGE-1 (0.1409) on the six overlapping languages while using only 3.3% of its parameters. A critical methodological contribution of this work is the demonstration that the standard Google rouge_score library returns zero for all Indic scripts due to English centric tokenization; we implement and advocate for whitespace-based ROUGE evaluation as the correct approach. MINT additionally benefits from BERTScore-F1 of 0.8497 (via XLM-RoBERTa-Large) and LaBSE embedding cosine similarity of 0.4306, confirming that generated summaries carry semantic meaning even when surface overlap metrics are modest. All code and checkpoints are publicly released.

Like 0

Liked Liked