[R] Mixture-of-Models routing beats single LLMs on SWE-Bench via task specialization
I’ve been looking at per-task results on SWE-Bench Verified and noticed something that leaderboard averages hide: different models consistently solve different subsets of tasks.
Even the top overall model on the leaderboard fails a non-trivial number of tasks that other models reliably solve, and the reverse is also true. This suggests strong task-level specialization rather than one model being strictly better.
To test this, I built a Mixture-of-Models architecture, which is different from traditional routing that just defaults to the strongest aggregate model most of the time. The goal isn’t to route to a single model as often as possible, but to exploit complementary strengths between models.
Concretely:
- The problem description is embedded
- It’s assigned to a semantic cluster (learned from general coding data, not SWE-Bench)
- Each cluster has learned per-model success statistics
- The task is routed to the historically strongest model for that type of problem
Importantly, this does not route the top aggregate model for the majority of tasks. Several clusters consistently route to other models where they outperform it, even though it has the highest overall score.
There’s no new foundation model, no test-time search, and no repo execution, just a lightweight gating mechanism over multiple models.
Using this Mixture-of-Models setup, the system reaches 75.6% on SWE-Bench, exceeding single-model baselines (~74%). The takeaway isn’t the absolute number, but the mechanism: leaderboard aggregates hide complementary strengths, and mixture architectures can capture a higher ceiling than any single model.
Blog with details and methodology here: https://nordlyslabs.com/blog/hypernova
Github: the framework is open source ! https://github.com/Nordlys-Labs/nordlys
submitted by /u/botirkhaltaev
[link] [comments]