Frameworks For Supporting LLM/Agentic Benchmarking [P]

I think the way we are approaching benchmarking is a bit problematic. From reading about how frontier labs benchmark their models, they essentially create a new model, configure a harness, and then run a massive benchmarking suite just to demonstrate marginal gains.

I have several problems with this approach. I worry that we are wasting a significant amount of resources iterating on models and effectively trading carbon for confidence. Looking at the latest Gemini benchmarking, for instance, they applied 30,000 prompts. While there is a case to be made for ensuring the robustness of results, won’t they simply run those same benchmarks again as they iterate, continuing to consume resources?

It is also concerning if other organizations emulate these habits for their own MLOps. It feels like as a community, we are continuing to consume resources just to create a perceived sense of confidence in models. However, I am not entirely sold on what is actually being discerned through these benchmarks. pass@k is the usual metric, but it doesn’t really inspire confidence in a model’s abilities or communicate improvements effectively. I mean the point is essentially seeing how many attempts it takes for the model to succeed.

With these considerations in mind, I started thinking through different frameworks to create more principled benchmarks. I thought Bayesian techniques could be useful for modeling the confidence of results in common use casee. For instance, determining if “Iteration A” is truly better than “Iteration B.” Ideally, you should need fewer samples to reach the required confidence level than you would using an entire assay of benchmarks.

To explore some potential solutions, I have been building a Python package, bayesbench, and creating adapters to hook into popular toolchains.

I imagine this could be particularly useful for evaluating agents without needing to collect massive amounts of data, helping to determine performance trajectories early on. I built the demo on Hugging Face to help people play around with the ideas and the package. It does highlight some limitations: if models are too similar or don’t have differentiated performance, it is difficult to extract a signal. But if the models are different enough, you can save significant resources.

I’m curious how others are thinking about benchmarking. I am familiar with tinyBenchmarks, but how do you think evaluation will shift as models become more intensive to evaluate and costly to maintain? Also, if anyone is interested in helping to build out the package or the adapters, it would be great to work with some of the folks here.

submitted by /u/NarutoLLN
[link] [comments]

Liked Liked