BHRE-RAG: A Benchmark and Retrieval-Augmented Framework for Advancing Comprehension-Based Question Answering in Bangla

Large language models excel in English but struggle with low-resource languages such as Bengali due to limited training data and complex linguistic structures. This paper represents a groundbreaking novel system that applies state-of-the-art natural language processing techniques to examine subject-specific chapters and generate questions with corresponding solutions of different lengths and addresses this gap through two key contributions. First, we introduce the Bangla Holistic Reasoning Evaluation (BHRE), a comprehensive zero-shot and few-shot assessment of LLMs (GPT-4, Llama-3.1, Mixtral-8x, Qwen2.5, Mistral, Gemma) on the challenging BanglaRQA dataset. Second, we propose a Retrieval Augmented Generation (RAG) framework with BHRE that enhances LLM performance by retrieving precise and contextual evidence before generating answers. Using the BanglaRQA question-answering dataset, comprising 3,000 context passages and 14,889 question-answer pairs, we benchmarked these LLMs using EM and F1 metrics against BanglaT5, a fine-tuned state-of-the-art model. Our results show that Llama-3 emerged as the top performing model in F1 and EM, while our RAG-based approach elevates its performance much better than the other two approaches, surpassing the previous fine-tuned SOTA (BanglaT5). This work demonstrates that prompt engineering techniques on LLMs can rival fine-tuned models, achieving top-notch quality answers even without fine-tuning, and the effectiveness of RAG systems for low-resource languages, and provides a reproducible framework for future research by enhancing the capability of language models.

Liked Liked