Comparing Single-Agent and Multi-Agent Strategies in LLM-Based Title-Abstract Screening

Title-and-abstract screening remains labour-intensive, especially in interdisciplinary domains where shared terminology increases misclassification risk. This study compared five LLM coordination strategies — single-agent baseline, majority voting, recall-focused ensemble, confidence-weighted aggregation, and two-stage debate — using four 4-bit quantised open-source models (Mistral 7B, LLaMA 3.1 8B, Granite 3.3 8B, Qwen 2.5 7B) in zero-shot and few-shot configurations. The evaluation was conducted on a Gold Standard of 200 papers from a corpus of 2,036 records on blockchain-based e-voting. The best-performing configuration — a single-agent strategy with Qwen 2.5 7B in few-shot mode — achieved recall of 100%, precision of 70.4%, F1 of 82.6%, and a 43.4% reduction in manual screening effort, outperforming all multi-agent alternatives. Confidence-weighted aggregation produced results identical to majority voting, indicating that self-reported confidence from 7–8B parameter models did not add discriminative value. All decisions were recorded on a private Antelope blockchain with OpenTimestamps anchoring and Zenodo archival. These results suggest that, for domain-specific screening tasks, careful model selection outweighs multi-agent coordination overhead, and that few-shot prompting with a well-matched model can achieve human-level recall with substantially reduced manual effort.

Liked Liked