“Summary of METR’s predeployment evaluation of GPT-5.6 Sol”, METR (“71hrs (95% CI: 13–11,400hrs)”; now so reward-hack-prone + eval-aware that its capabilities are nearly untestable)

"Summary of METR's predeployment evaluation of GPT-5.6 Sol", METR ("71hrs (95% CI: 13–11,400hrs)"; now so reward-hack-prone + eval-aware that its capabilities are nearly untestable) submitted by /u/gwern
[link] [comments]
Liked Liked