What is the scientific value of administering the standard Rorschach test to LLMs when the training data is almost certainly contaminated? (R) + [D]
A recent paper published in JMIR Mental Health (Csigó & Cserey, 2026) caught my attention. The researchers administered the 10 standard Rorschach inkblot cards to three multimodal LLMs (GPT-4o, Grok 3, Gemini 2.0) and coded their responses using the Exner Comprehensive System. They analyzed the models’ “perceptual styles,” determinants (like human movement vs. color), and human-related content themes.
However, I am seriously struggling to understand the methodological validity of this setup, and I’m curious what the scientific community thinks. My main concerns are:
Massive Data Contamination: The 10 standard Rorschach cards, along with decades of psychological literature, scoring manuals (like the Exner system), and typical human responses, are widely available on the internet. It is highly probable that this data is already embedded in the models’ training weights.
Testing Retrieval, Not Perception: Because they used the standard, century-old inkblots instead of novel, AI-generated, or strictly controlled ambiguous images, aren’t they just testing the models’ ability to retrieve the most statistically probable lexical associations for those specific images from their training data?
Lack of Controls: As I understand according to the paper, the researchers used the public web interfaces with default settings (no API, no temperature control) and seemingly only ran the test once per model, generating a tiny sample size.
Ironically, the authors explicitly admit in their “Limitations” section that the models likely encountered the stimuli and scoring concepts during training, which could influence outputs independently of any image understanding. So, methodologically what is the actual scientific value of conducting projective psychological tests on LLMs without using novel stimuli to – at least try – rule out data contamination? What do you think, based of mechanisms of LLMs, does a study like this tell us anything meaningful about how AI processes visual ambiguity, or is it merely demonstrating advanced pattern matching and text completion based on widely known psychometric data? And – how do studies with such glaring methodological loopholes regarding LLM training data contamination make it through peer review in decent journals? Maybe I’m a little bit critical here, I just wanted to be a little provocative. Here is the study: https://mental.jmir.org/2026/1/e88186?fbclid=IwY2xjawRd27dleHRuA2FlbQIxMQBzcnRjBmFwcF9pZBAyMjIwMzkxNzg4MjAwODkyAAEe-wkKP6fKZRmAAuNvtN6BjknolIGcfTGu0-cLFs6CC49kZ1gcR6ccdcaRiWA_aem_7hHg5G96xjDZ-04YlSs1Ew
submitted by /u/Impossible_Echo4029
[link] [comments]