[R] AIRS-Bench: A Benchmark for AI Agents on the Full ML Research Lifecycle
We’re releasing AIRS-Bench, a new benchmark from FAIR at Meta to track whether an AI agent can perform ML research starting from scratch.
Our goal was to evaluate the full research lifecycle beyond just coding. The 20 tasks in AIRS-Bench require agents to handle everything from ideation and experiment design to iterative refinement, with no baseline code provided. The tasks are sourced from recent ML papers, so agent performance is measured against the reality of SOTA research.
Key Observations:
- We tested 14 agent configurations (using models like GPT-4o, o3-mini, etc.) on scaffolds like ReAct and Greedy Search.
- Agents managed to beat the human SOTA in 4 out of the 20 tasks, sometimes with novel solutions not in the original paper (e.g., creating a two-level stacked ensemble).
- However, agents failed to match SOTA in the other 16 tasks, and the overall benchmark is far from saturated (23.4% average normalized score).
- Just producing a valid submission is a major challenge: only 58.8% of agent attempts were successful.
We believe this provides a grounded look at the current state of AI research agents and a useful tool for the community to measure progress.
Paper (arXiv): https://arxiv.org/abs/2602.06855
Code & Tasks: https://github.com/facebookresearch/airs-bench
Here’s a twitter thread for quick summary (happy to delete this from post if against guidelines): https://x.com/BhavulGauri/status/2020938358982394332?s=20
submitted by /u/little_by_little_24
[link] [comments]