METR’s Time Horizon benchmark (TH1 / TH1.1) estimates how long a task (in human-expert minutes) a model can complete with 50% reliability.
https://preview.redd.it/sow40w7ccsjg1.png?width=1200&format=png&auto=webp&s=ff50a3774cfdc16bc51beedb869f9affda901c9f
Most people look at p50_horizon_length.
However, the raw TH1.1 YAML also includes working_time: total wall-clock seconds the agent spent across the full suite (including failed attempts). This is not FLOPs or dollars, but it’s still a useful “how much runtime did the eval consume?” signal.
Links:
What jumped out
At the top end:
- GPT-5.2: ~142.4 hours working_time, p50 horizon 394 min
- Claude Opus 4.5: ~5.5 hours working_time, p50 horizon 320 min
That’s roughly 26× more total runtime for about 23% higher horizon.
If you normalize horizon per runtime-hour (very rough efficiency proxy):
- Claude Opus 4.5: ~58 min horizon / runtime-hour
- GPT-5.2: ~2.8 min horizon / runtime-hour
(checkout the raw YAML for full results)
Big confounder (important)
Different models use different scaffolds in the YAML (e.g. OpenAI entries reference triframe_* scaffolding, others reference metr_agents/react). That can change tool-calling style, retries, and how “expensive” the eval is in wall-clock time. So I’m treating working_time as a signal, not a clean apples-to-apples efficiency metric.
Questions for the sub
- Should METR publish a secondary leaderboard that’s explicit about runtime/attempt budget (or normalize by it)?
- How much of this gap do you think is scaffold behavior vs model behavior?
- Is there a better “efficiency” denominator than working_time that METR could realistically publish (token counts, tool-call counts, etc.)?METR’s Time Horizon benchmark (TH1 / TH1.1) estimates how long a task (in human-expert minutes) a model can complete with 50% reliability.Most people look at p50_horizon_length.However, the raw TH1.1 YAML also includes working_time: total wall-clock seconds the agent spent across the full suite (including failed attempts). This is not FLOPs or dollars, but it’s still a useful “how much runtime did the eval consume?” signal.Links:Methodology / TH1 baseline: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/ TH1.1 update: https://metr.org/blog/2026-1-29-time-horizon-1-1/ Raw YAML: https://metr.org/assets/benchmark_results_1_1.yaml Analysis repo: https://github.com/METR/eval-analysis-publicWhat jumped outAt the top end:GPT-5.2: ~142.4 hours working_time, p50 horizon 394 min Claude Opus 4.5: ~5.5 hours working_time, p50 horizon 320 minThat’s roughly 26× more total runtime for about 23% higher horizon.If you normalize horizon per runtime-hour (very rough efficiency proxy):Claude Opus 4.5: ~58 min horizon / runtime-hour GPT-5.2: ~2.8 min horizon / runtime-hour(checkout the raw YAML for full results)Big confounder (important)Different models use different scaffolds in the YAML (e.g. OpenAI entries reference triframe_* scaffolding, others reference metr_agents/react). That can change tool-calling style, retries, and how “expensive” the eval is in wall-clock time. So I’m treating working_time as a signal, not a clean apples-to-apples efficiency metric.Questions for the subShould METR publish a secondary leaderboard that’s explicit about runtime/attempt budget (or normalize by it)? How much of this gap do you think is scaffold behavior vs model behavior? Is there a better “efficiency” denominator than working_time that METR could realistically publish (token counts, tool-call counts, etc.)?
Btw I’m starting a new home for discussions of how AI models compare across several domains and evals, if interested consider joining us at r/CompetitiveAI