How do you analyze the relative “strength” of probes? [R]

digitado ⋅ 17 de June de 2026

This question is related to topics like language+ models (including multimodal) and things like “circuit” analyses. I think something related might come up in my work (factuality guarantees for model outputs) and I’m trying to orient to the SoTA.

I found this old post on trying to deduce, for instance, whether a Transformer-based model “knows” which word a token is in. Even in this simple example, I noticed some meaningful problems (I detail in a footnote¹ to not derail my question) – and I’ve heard that circuit research is pretty fraught.

The post claimed to train a logistic regression classifier. What I’m curious about is, how do you balance between the capacity of this probe, and the underlying network?

Specifically, I would like to know:

Is there theory which grounds inquiries of “what you can learn” in concrete terms? (Perhaps in terms of provable guarantees about overfitting? Or are there Nyquist-type guarantees available about sampling based on frequencies of patterns in language corpora – i.e., can we say we’ve “seen enough data” to know the network can reliably do something in all cases?)
Has any of the existing work factored in attempts to label the “difficulty” of examples? (Perhaps by ensembling some training of models and looking at accuracy on them. I realize bootstrap is insanely expensive for language models due to training costs.)

Problems – well, first of all, the number of possible words is so small that I suspect performance looks unrepresentatively good. The classifier seems to gain in performance for words 5/6 after weakening, but that might just be learning “all sufficiently ‘extreme’ tokens should be words 5 or 6.” For another, despite the claim advanced in the article (Nanda concludes the network essentially does learn positions), I happen to have screenshots from recently playing with Google Gemini and asking it how many “r”s and other letters are in Google. Not only did it answer incorrectly – it claimed 1 – but more worryingly, it spelled out G-o-o-g-l-e in answering. This belies a hypothesis of “it’s incapable of learning exactly how to decompose tokens, so this question was unfair from a model capacity standpoint” but *still* leads to an incorrect answer!

submitted by /u/RepresentativeBee600
[link] [comments]

Like 0

Liked Liked