[D] Peer matrix evaluation: 10 frontier models judge each other’s responses to eliminate single-evaluator bias. Results from async debugging and probability reasoning tasks.

digitado ⋅ 14 de January de 2026

Methodology:

10 frontier models (Claude Opus/Sonnet 4.5, o1, GPT-4o, Gemini 3 Pro, Grok 4, DeepSeek V3.2, Llama 4 Scout, Mistral Large, Command A)
Each answers identical prompt blindly
All 10 judge all 10 responses (100 judgments)
Self-judgments excluded from final scores
5 criteria: Correctness (30%), Completeness (20%), Clarity (20%), Depth (15%), Usefulness (15%)

CODE-001 Results (Async Python Debugging):

REASON-001 Results (Two Envelope Paradox):

Judge Bias Patterns:

Open questions for feedback:

Daily evals at themultivac.com — currently in Phase 2 (peer matrix format).

Like 0

Liked Liked