AI Evaluation is Becoming an Exciting Standalone Discipline

digitado ⋅ 16 de May de 2026

Introduction

Throughout my academic career, I worked on various problems surrounding the robustness of deep learning models. Until a few years ago, these topics had their own niches within the academic but also industrial community. While some of this work got adopted in specific domains, e.g., consider corruption robustness benchmarks such as ImageNet-C in computer vision, much of it stayed separate from more mainstream developments such as foundation models.

Today, in contrast, work on evaluation metrics and benchmarks is becoming an exciting standalone discipline in academia and industry, fueled by big foundation models that are often inaccesible and irreproducible. Researchers across application domains are learning that good evaluations of AI systems hinges on many details previously ignored — controlling randomness, ensuring quality of data, mathematically understanding metrics and their approximations. As a result, I like to think about my academic career as getting a PhD in AI evaluation rather than the specific robustness problems I worked on. In this article, I want to share some thoughts looking back at the last few years and why I believe AI evaluation will become an even more important problem going forward.

From adversarial examples to jailbreaks

Adversarial examples have received considerable attention since their discovery and popularization around 10 years ago. They were studies in various contexts: First, from the perspective of understanding deep neural networks and when and why they make mistakes. Second, from the perspective of security (what is now known as ML or AI security) where adversarial examples are understood as adversarial attacks on models. In my perception, throughout the past 10 years, adverarial robustness and related problems were more and more perceved as a security problem rather than a machine learning problem.

Today, adversarial examples are back, more prominent than ever, in the form of jailbreak attacks, prompt injection attacks, prompt extraction attacks and many other security problems surrounding LLMs and AI agents. The large variety of new attacks and threat models is also due to the generative nature of models. Compared to classification models where adverarial attacks mainly targeted mis-cassification, attacks on generative models can have all sorts of goals ranging from unlocking some “hidden” ability such as toxic or inappropriate generations to extracting secrets of prompts or AI agents. Lately, this is also combined with more classical adversarial examples for multimodal models.

At the same time, these threat models are not just subject of academic research but becoming increasingly relevant for industry as LLMs and LLM applications become the target of hackers. Users try to get LLMs to do things they are not supposed to be and soon there will be various incentives to mislead LLMs.

The 0.01% case is suddenly releant

In academia, many researchers broadened their focus, looking beyond standard adversarial examples. This included me, as I started to believe that adversarial examples are too narrow of a problem and often too unrealistic — both in terms of the assumed access to models but also in that it address worst-case scenarios, targeting the 0.1% or 0.01% of cases or less.

With huge interest and user-bases of recent LLM applications, however, the 0.01% becomes relevant. Few, individual users can inflict large damage by adversarially mis-using the system or extracting secret information such as prompts or other system parameters. As a result, prompt injections, jailbreaks, and similar attacks are becoming a series problem for LLm providers. Essentially, robustness and in particular worst-case robustness of AI systems is becoming a mainstream problem.

Similar examples apply for out-of-distribution examples — often unintentional but sometimes also intentional weird changes or sequences of tokens. Many jailbreaks or prompt injections make use of weird token sequences with some hidden code or small logical puzzles to confuse LLMs. Anticipating and properly evaluating these cases is key for monitoring usage and understanding vulnerabilities

Many applications have robustness evaluation characteristics

The first step towards improving robustness and avoiding such vulnerabilities is evaluation. As a researcher working on adversarial robustness, I spent most of my time thinking about how to properly evaluate robustness in different settings. Different threat models lead to different results, one can look at the average- or worst-case or anything in between, and distribution shifts or training data contamination easily lead to misleading results.

Moreover, in the age of multi-modal LLMs, many evaluation problems actually boil down to robustness problems:

Factuality is essentially a robustness problem and is often evaluated in such a way. This involves trying to find cases and domains where LLMs do not respond factually or figuring out why LLMs respond factually using one way of prompting but not when prompting differently.
A large part of safety and red teaming work is about probing the model in a manipulative or adversarial way to understand or provoke edge cases that can then bs used for fine-tuning or evaluation.
Prompting itself is often about robustness. We all came to understand that the language we use in the prompt has a huge impact on the quality of the outcome, even if the semantic content of the prompt may be the same. To some extent this is intended; if I use casual language, I expect a chatbot to respond in casual language. However, in many settings this is unintended. For example, I want to be able to prompt by coding agent with typos and still expect the highest quality output.
…

What makes these problems hard

Evaluation problems with robustness characteristics can become very difficult in various ways. For example, benchmarks and test sets change rapidly. In safety, for example, this is due to attackers constantly adapting to new models. However, even in reasoning or coding tasks, benchmarks are intentionally adapted to avoid leakage into training sets or more realistically reflect real usage. However, this quickly introduces unwanted distribution shift that need to be handled carefully when comparing results. Tracking performance becomes tricky as not only models/agents change but also test sets/metrics.

Then, we start to consider tens if not hundreds of benchmarks simulataneously. This means metrics need to be aggregated appropriately across test sets of different sizes, different difficulties, different tasks, languages, and modalities. In classical adversarial robustness research, this was a common problem because we commonly looked at multiple attacks. Some of our work on watermarking also considers tens of transformations that we want the watermark to be robust against. However, in academic rsearch it was common to focus on one task at a time, possibly including a handful of benchmarks at most in a single project or paper.

Evaluations also include more and more sources of randomness. From a statistical standpoint, we typically assumed the model to be deterministic. Evaluation is about estimating an expectation of a metric over the distribution of input examples. However, in state-of-the-art LLM applications, models quickly become non-deterministic simply due to size and hardware. Then, we need to sample to generate outputs. Inputs also become more complex, often including system instructions, task instructions, few shot examples and the actual input example. Finally, in agentic tasks, trajectories need to be unrolled. To estimate a metrics, strictly speaking, we need to understand all these different sources of randomness.

Finally, obtaining dense ground truth is becoming more difficult. As a result, we are resorting to auto-raters more often. This, however, introduces biases that are often difficult to understand, but may be problematic at large scale. The ground truth we may have is becoming more nuanced and subjective as we tackle more challenging tasks. Moreover, ground truth is becoming moe sparse such as the final result of a reasoning or agent task rather than ground truth trajectories.

It will get more difficult

With agentic applications, especially agentic coding, on the rise, I expect evaluation becoming the main bottleneck. This will not only be true for modeling (pre-, mid-, post-training) but also for developing the right agentic harnesses. Evaluations will become more integrated, depending on sandboxes for tool calls and function calling and evaluating such systems will be more similar to testing complex software. This means larger teams will need to work on evaluations and a bigger portion of the academic community will focus on more and more complex benchmarks.

The post AI Evaluation is Becoming an Exciting Standalone Discipline appeared first on David Stutz.

Like 0

Liked Liked