“Summary of METR’s predeployment evaluation of GPT-5.6 Sol”, METR (“71hrs (95% CI: 13–11,400hrs)”; now so reward-hack-prone + eval-aware that its capabilities are nearly untestable)