How do you analyze the relative “strength” of probes? [R]

This question is related to topics like language+ models (including multimodal) and things like “circuit” analyses. I think something related might come up in my work (factuality guarantees for model outputs) and I’m trying to orient to the SoTA.

I found this old post on trying to deduce, for instance, whether a Transformer-based model “knows” which word a token is in. Even in this simple example, I noticed some meaningful problems (I detail in a footnote1 to not derail my question) – and I’ve heard that circuit research is pretty fraught.

The post claimed to train a logistic regression classifier. What I’m curious about is, how do you balance between the capacity of this probe, and the underlying network?

Specifically, I would like to know:

  • Is there theory which grounds inquiries of “what you can learn” in concrete terms? (Perhaps in terms of provable guarantees about overfitting? Or are there Nyquist-type guarantees available about sampling based on frequencies of patterns in language corpora – i.e., can we say we’ve “seen enough data” to know the network can reliably do something in all cases?)
  • Has any of the existing work factored in attempts to label the “difficulty” of examples? (Perhaps by ensembling some training of models and looking at accuracy on them. I realize bootstrap is insanely expensive for language models due to training costs.)

  1. Problems – well, first of all, the number of possible words is so small that I suspect performance looks unrepresentatively good. The classifier seems to gain in performance for words 5/6 after weakening, but that might just be learning “all sufficiently ‘extreme’ tokens should be words 5 or 6.” For another, despite the claim advanced in the article (Nanda concludes the network essentially does learn positions), I happen to have screenshots from recently playing with Google Gemini and asking it how many “r”s and other letters are in Google. Not only did it answer incorrectly – it claimed 1 – but more worryingly, it spelled out G-o-o-g-l-e in answering. This belies a hypothesis of “it’s incapable of learning exactly how to decompose tokens, so this question was unfair from a model capacity standpoint” but *still* leads to an incorrect answer!

submitted by /u/RepresentativeBee600
[link] [comments]

Liked Liked