[D] Peer matrix evaluation: 10 frontier models judge each other’s responses to eliminate single-evaluator bias. Results from async debugging and probability reasoning tasks.

Methodology:

  • 10 frontier models (Claude Opus/Sonnet 4.5, o1, GPT-4o, Gemini 3 Pro, Grok 4, DeepSeek V3.2, Llama 4 Scout, Mistral Large, Command A)
  • Each answers identical prompt blindly
  • All 10 judge all 10 responses (100 judgments)
  • Self-judgments excluded from final scores
  • 5 criteria: Correctness (30%), Completeness (20%), Clarity (20%), Depth (15%), Usefulness (15%)

CODE-001 Results (Async Python Debugging):

  1. Claude Opus 4.5: 9.49
  2. o1: 9.48
  3. Claude Sonnet 4.5: 9.41
  4. DeepSeek V3.2: 9.39
  5. Grok 4: 9.37
  6. Command A: 9.23
  7. Gemini 3 Pro: 9.19
  8. Mistral Large: 9.10
  9. GPT-4o: 8.79
  10. Llama 4 Scout: 8.04

REASON-001 Results (Two Envelope Paradox):

  1. Claude Opus 4.5: 9.24
  2. o1: 9.23
  3. Claude Sonnet 4.5: 9.09
  4. DeepSeek V3.2: 8.93
  5. Grok 4: 8.88
  6. GPT-4o: 8.75
  7. Gemini 3 Pro: 8.68
  8. Mistral Large: 8.64
  9. Command A: 8.38
  10. Llama 4 Scout: 7.92

Judge Bias Patterns:

  • Strictest: Claude Opus (avg 7.10-8.76 depending on task)
  • Most lenient: Mistral Large (9.22-9.73)
  • Correlation: Strict judges tend to score higher themselves

Open questions for feedback:

  1. Is 5-point rubric weighting optimal for different task types?
  2. Should we normalize for judge harshness before aggregating?
  3. Are 9 judgments per response sufficient for statistical validity?

Full data + prompts: https://themultivac.substack.com

Daily evals at themultivac.com — currently in Phase 2 (peer matrix format).

submitted by /u/Silver_Raspberry_811
[link] [comments]

Liked Liked