“Summary of METR’s predeployment evaluation of GPT-5.6 Sol”, METR (“71hrs (95% CI: 13–11,400hrs)”; now so reward-hack-prone + eval-aware that its capabilities are nearly untestable)
|
submitted by /u/gwern [link] [comments] |
Like
0
Liked
Liked