[P] I built an open-source benchmark to test if LLMs are actually as confident as they claim to be (Spoiler: They often aren’t)

[P] I built an open-source benchmark to test if LLMs are actually as confident as they claim to be (Spoiler: They often aren't)

Hey everyone,

When building systems around modern open-source LLMs, one of the biggest issues is that they can confidently hallucinate or state an incorrect answer with a 95%+ probability. This makes it really hard to deploy them into the real world reliably if we don’t understand their “overconfidence gaps.”

To dig into this, I built the LLM Confidence Calibration Benchmark.

My goal was to analyze whether their stated output confidence mathematically aligns with their true correctness across different modes of thought.

What it tests: I evaluated several leading models (Llama-3, Qwen, Gemma, Mistral, etc.) across 4 distinct task types:

  1. Mathematics reasoning (GSM8K)
  2. Binary decision (BoolQ)
  3. Factual knowledge (TruthfulQA)
  4. Common sense (CommonSenseQA)

The Output: The pipeline parses their output confidences, measures semantic correctness, and generates Expected Calibration Error (ECE) metrics, combined reliability diagrams, and per-dataset accuracy heatmap.

It makes it incredibly easy to see exactly where a model is dangerously overconfident and where it excels, which can save a lot of headaches when selecting a reliable model for a specific use-case or RAG pipeline.

The entire project is open-source, and is fully reproducible locally (via Python) or on Kaggle.

If you are interested in checking out the code, the generated charts, or running evaluations yourself, you can find it here:

GitHub Repo: https://git.new/UlnWBA1

I’d love to hear your thoughts on this!

submitted by /u/ChallengingForce
[link] [comments]

Liked Liked