[P] I built an Open-Source Ensemble for Fast, Calibrated Prompt Injection Detection

I’m a working on a project called PromptForest, an open-source system for detecting prompt injections in LLMs. The goal is to flag adversarial prompts before they reach a model, while keeping latency low and probabilities well-calibrated.

The main insight came from ensembles: not all models are equally good at every case. Instead of just averaging outputs, we:

  1. Benchmark each candidate model first to see what it actually contributes.
  2. Remove models that don’t improve the ensemble (e.g., ProtectAI’s Deberta finetune was dropped because it reduced calibration).
  3. Weight predictions by each model’s accuracy, letting models specialize in what they’re good at.

With this approach, the ensemble is smaller (~237M parameters vs ~600M for the leading baseline), faster, and more calibrated (lower Expected Calibration Error) while still achieving competitive accuracy. Lower confidence on wrong predictions makes it safer for “human-in-the-loop” fallback systems.

You can check it out here: https://github.com/appleroll-research/promptforest

I’d love to hear feedback from the ML community—especially on ideas to further improve calibration, robustness, or ensemble design.

submitted by /u/Valuable-Constant-54
[link] [comments]

Liked Liked