[D] Peer matrix evaluation: 10 frontier models judge each other’s responses to eliminate single-evaluator bias. Results from async debugging and probability reasoning tasks.
Methodology:
- 10 frontier models (Claude Opus/Sonnet 4.5, o1, GPT-4o, Gemini 3 Pro, Grok 4, DeepSeek V3.2, Llama 4 Scout, Mistral Large, Command A)
- Each answers identical prompt blindly
- All 10 judge all 10 responses (100 judgments)
- Self-judgments excluded from final scores
- 5 criteria: Correctness (30%), Completeness (20%), Clarity (20%), Depth (15%), Usefulness (15%)
CODE-001 Results (Async Python Debugging):
- Claude Opus 4.5: 9.49
- o1: 9.48
- Claude Sonnet 4.5: 9.41
- DeepSeek V3.2: 9.39
- Grok 4: 9.37
- Command A: 9.23
- Gemini 3 Pro: 9.19
- Mistral Large: 9.10
- GPT-4o: 8.79
- Llama 4 Scout: 8.04
REASON-001 Results (Two Envelope Paradox):
- Claude Opus 4.5: 9.24
- o1: 9.23
- Claude Sonnet 4.5: 9.09
- DeepSeek V3.2: 8.93
- Grok 4: 8.88
- GPT-4o: 8.75
- Gemini 3 Pro: 8.68
- Mistral Large: 8.64
- Command A: 8.38
- Llama 4 Scout: 7.92
Judge Bias Patterns:
- Strictest: Claude Opus (avg 7.10-8.76 depending on task)
- Most lenient: Mistral Large (9.22-9.73)
- Correlation: Strict judges tend to score higher themselves
Open questions for feedback:
- Is 5-point rubric weighting optimal for different task types?
- Should we normalize for judge harshness before aggregating?
- Are 9 judgments per response sufficient for statistical validity?
Full data + prompts: https://themultivac.substack.com
Daily evals at themultivac.com — currently in Phase 2 (peer matrix format).
submitted by /u/Silver_Raspberry_811
[link] [comments]
Like
0
Liked
Liked