The famous METR AI time horizons graph contains numerous severe errors [D]
Nathan Witkin, a research writer at NYU Stern’s Tech and Society Lab, writes damningly about the famous METR AI time horizons graph in the Substack publication Transformer:
It is impossible to draw meaningful conclusions from METR’s Long Tasks benchmark — in particular once one realizes that its numerous flaws are probably compounding in unpredictable ways. The appropriate response to a study of this kind is not to assume it can be saved via back-of-the-envelope adjustments, or to comfort oneself that other anecdotal evidence implies that it is probably correct anyway. It is to cut one’s losses and move on in search of higher-quality information.
… The METR graph cannot be saved. For all its sleekness and complexity, it contains far too many compounding errors to excuse. Among them is generalizing to the entire species data collected from a small group of the authors’ peers. Coming up with ever more dramatic ways to make this mistake has become a kind of sport among AI researchers. If the field has a central pathology, it is to aggressively overindex on a mix of anecdotal data from power-users, alongside a long list of benchmarks even more compromised than METR’s. One hopes that as the field matures, its participants will learn to stop making these mistakes.
The errors include:
- Some of the human baselines data is not actually measured or collected from any empirical source, rather, it is just guesstimated by the authors
- A key variable in the data is how long it takes humans to complete certain tasks, but — when METR did actually measure this — it paid its human benchmarkers hourly, meaning they were incentivized with cash to take longer
- The sample of human benchmarkers was biased toward METR employees’ friends, acquaintances, and former colleagues (who are likely unrepresentative and possibly biased)
- Humans familiar with a codebase and a specific coding task were 5-18x faster at completing it, but METR used data from humans who were much slower because they had to spend time familiarizing themselves the codebase and the task at hand
- Test-training data contamination occurred because some of the tasks had published solutions online, which most likely would have been included in LLMs’ training datasets
- And many more
Please read the full post. It’s not too long and it’s accessible to general audience. It’s worthwhile to read the whole post and see how many errors were made in the creation of the METR graph and just how bad they are.
If you want to read about even more errors in the METR graph not covered in Nathan Witkin’s post, read this post by the AI researchers Gary Marcus and Ernest Davis.
The METR graph is a great example of why scientific standards and best practices are so important, and why enforcing them through processes like peer review is necessary to prevent us from drowning in bad information. It’s extremely dangerous to rely on information that only superficially appears scientific but wasn’t actually conducted with the rigour normally required of scientific research.
submitted by /u/common_yarrow
[link] [comments]