[R] paper on Evaluative Fingerprints: Stable and Systematic Differences in LLM Evaluator Behavior
TL;DR
A lot of LLM eval pipelines treat “LLM-as-judge” as a rough but usable proxy for quality. I kept running into something that felt off: different judges would give very different scores, yet each judge was weirdly consistent with itself. This paper tries to measure that effect and show it’s not random noise.
What I did:
I set up a simple multi-judge pipeline and ran the same items through multiple “judge” models, multiple times, using the same rubric and strict JSON output.
Dataset 1: YouTube → SEO content packs – 30 YouTube videos, 15 categories – 4 generated “content packs” per video – 120 video×pack pairs – 3 runs × 9 judges = 3,240 total evaluations
Judges:
Claude-Opus-4.5, Claude-Sonnet-4.5, GPT-5.2, GPT-4.1, Gemini-3-Pro-Preview, Grok-3, DeepSeek-R1, Llama-405B, Mistral-v3-Large
Rubric:
Five 1–5 dimensions: Intent/Angle, Coverage, Faithfulness + receipts, Readability, and SEO mechanics. Judges also had to include quoted “receipts” from the source.
What fell out of it:
Across judges, agreement is basically near zero: – Krippendorff’s α (overall) ≈ 0.042
A couple dimensions even go negative (systematic disagreement), especially Readability and SEO mechanics. But many judges are stable with themselves
Across three runs, within-judge reliability (ICC(3,1)) ranges from about -0.04 up to 0.87. Several judges are above 0.8. So the same judge will usually make the same call, even when other judges disagree.
You can often tell which judge produced the eval
If you treat “which judge wrote this evaluation row?” as a classification task: • Scores only: 77.1% accuracy (9-way) • Evidence/disposition features only: 71.5% • Combined: 89.9%
Even within a single provider, the signal is strong: • GPT-4.1 vs GPT-5.2: 99.6%
This isn’t just “who’s harsher.” The shape of the scores across dimensions and the way receipts are used is informative.
Receipts behave differently too:
I also looked at whether receipts actually exist in the source text and whether they really support the justification under a conservative entailment-style check. Some judges cite a lot but with weaker linkage, others cite less but more tightly.
Second domain (to see if this was a fluke)
I repeated the idea on a different setup: • 15 Wikipedia articles • A structured “briefing pack” output format • Controlled variants: clean, hallucination-poisoned, coverage-poisoned, structure-poisoned
The fingerprints carry over: • Combined judge ID is about 90% • GPT-4.1 vs GPT-5.2 hits 100% in this regime
Also, hallucination detection varies a lot by judge. Some reliably penalize poisoned content, others barely move.
I’d love your feedback. My follow up work will be temporal delta and new regimes/domains with diff eval rubrics
submitted by /u/PromptOutlaw
[link] [comments]