[P] I built an Open-Source Ensemble for Fast, Calibrated Prompt Injection Detection
I’m a working on a project called PromptForest, an open-source system for detecting prompt injections in LLMs. The goal is to flag adversarial prompts before they reach a model, while keeping latency low and probabilities well-calibrated.
The main insight came from ensembles: not all models are equally good at every case. Instead of just averaging outputs, we:
- Benchmark each candidate model first to see what it actually contributes.
- Remove models that don’t improve the ensemble (e.g., ProtectAI’s Deberta finetune was dropped because it reduced calibration).
- Weight predictions by each model’s accuracy, letting models specialize in what they’re good at.
With this approach, the ensemble is smaller (~237M parameters vs ~600M for the leading baseline), faster, and more calibrated (lower Expected Calibration Error) while still achieving competitive accuracy. Lower confidence on wrong predictions makes it safer for “human-in-the-loop” fallback systems.
You can check it out here: https://github.com/appleroll-research/promptforest
I’d love to hear feedback from the ML community—especially on ideas to further improve calibration, robustness, or ensemble design.
submitted by /u/Valuable-Constant-54
[link] [comments]