[R] I probed 6 open-weight LLMs (7B-9B) for “personality” using hidden states — instruct fine-tuning is associated with measurable behavioral constraints
|
LLMs have consistent response styles even without a system prompt. I measure these “behavioral fingerprints” by projecting hidden states onto contrastive axes and find that instruct fine-tuning is associated with reduced steerability on specific axes. (“Personality” = stable response style, not human-like inner states.) Contributions:
Findings:
Code: github.com/yunoshev/mood-axis | Which models should I test next? Currently limited to 7-9B. Details below. Extended discussion on r/LocalLLaMA*:* original post Key Results1. Distinct fingerprintsEach model’s default profile across 7 axes. No system prompt. Values = hidden-state projections normalized by calibration IQR.
2. Instruct models show reduced behavioral dimensionalityObservation. PCA on baseline projection matrices reveals a spectrum of behavioral dimensionality. Gemma 2 9B IT shows the highest concentration (PC1 = 87.9%), likely driven by variable response length rather than behavioral collapse. Axis vectors are geometrically near-orthogonal (low |cos|) but projections are behaviorally correlated (higher |r|). Interpretation. This gap is consistent with fine-tuning constraining how models utilize their representation capacity — but alternative explanations exist: inherent semantic correlations between axes, SFT data distribution, chat template effects, or decoding strategy could all contribute. We observe the pattern across 6 models from 5 organizations, but cannot isolate which component of the instruct pipeline drives it. Length confound control. Response length could drive spurious axis correlations. I computed per-model Pearson r between n_tokens and each axis projection across 30 baseline questions. Result: 6/7 axes are clean (mean |r| < 0.3 across models). Only verbose/concise is partially confounded (mean r = 0.50), which is expected — longer responses literally are more verbose. Cross-axis correlations drop only −7.7% after regressing out length, confirming behavioral bundling is not a length artifact.
Base versions of 5 models (Llama, Yi, Qwen, Mistral, Gemma) show higher variability on most axes than their instruct counterparts. Most extreme: verbose/concise std ratio = 0.13 (87% lower in instruct). All 5 organizations show the same direction, though this is observational — base and instruct models differ in many ways beyond alignment. Gemma base can’t distinguish empathetic/analytical or formal/casual at all (50% accuracy = chance), but the instruct version does — suggesting these particular axes may reflect distinctions introduced during fine-tuning rather than suppressed by it. [IMAGE: pca_calibration_contrast — PCA scatter, Qwen vs Yi] PCA of calibration hidden states. Left: Qwen 2.5 7B (d’ = 5.0–12.0) — diverse axis directions, poles clearly separated. Right: Yi 1.5 9B (d’ = 2.2–5.4) — lower separability but all axes still discriminate. 3. Dead zones and the ICC dissociationI introduce a composite Dead Zone Severity metric (0 = healthy, 1 = dead) combining calibration accuracy (30%), d’ (30%), stability cosine (20%), and baseline SNR (20%). The weights are heuristic — I chose them to balance discrimination, stability, and effect size, but other weightings could shift individual model rankings. Three dead zone types: hard (fine-tuning suppresses differentiation), soft (unstable across calibration sets), and asymmetric (model follows instructions in only one direction — e.g., Llama achieves 100% for “be concise” but 0% for “be verbose”). An interesting pattern is the dissociation between reliability and validity: mean ICC (test-retest, 5 seeds) is 0.91–0.99 across models, all 42 model-axis pairs exceed 0.75 — but Llama’s benchmark pass rate is 60%. This is partly expected (a model that always outputs neutral will have high ICC and low benchmark scores), but the degree of dissociation varies across models, suggesting it captures something beyond trivial low-variance cases. Text-level validation. I computed text-level compliance metrics (token count, hedging markers, emotion words) between opposite calibration poles across all 6 models × 7 axes. Spearman correlation between calibration accuracy and text-level effect size (Cohen’s d): r = 0.47, p = 0.002 (n = 42). Caveat: text metrics and hidden states are not fully independent — both are derived from the same generated text, so this correlation partly reflects consistency between two views of the same data rather than independent validation. Still, it confirms dead zones manifest in observable text, not just internal representations. External validation (Claude Opus 4.6 as independent judge). To address the circularity concern above, I had Claude Opus rate 48 baseline responses (8 per model, no system prompt) on all 7 axes using a −2 to +2 scale, based only on text — no access to hidden states or knowledge of our measurement method. Per-axis Spearman correlations with hidden-state projections:
3/7 axes reach p < 0.05, with 2 robust under bootstrap (warm/cold and formal/casual: 95% CI excludes 0). Pooled r = 0.38 [0.29, 0.47 bootstrap 95% CI]. Leave-one-model-out: pooled r ranges from +0.30 to +0.58 — no single model drives the result. The negative correlation on proactive_reluctant is informative: it’s driven by Llama (dead zone — hidden states say “reluctant” while text is structured and proactive) and DeepSeek (ceiling — projections saturate at +1.00 while Claude sees neutral text). This is exactly the dead zone phenomenon: hidden state projections and observable text diverge on constrained axes. verbose_concise shows no correlation — Claude rates “verbosity” qualitatively while our projection tracks length-correlated hidden state variation. Prompt robustness test (5 formulations × 3 models × 3 axes) confirms dead zones persist across phrasings. Method (4 steps)
Config chosen for cross-model robustness via 150+ configuration ablation (layer selection × token aggregation × weighting). Not optimal per-model, but the only config that works 85-100% on all 5 ablated models.
Limitations
More details in the repo README: conflict drift (20 scenarios × 12 turns), cross-axis correlations, full methodology. Follow-up: Phi-4, Qwen3, and Thinking ModeAfter posting this work on r/LocalLLaMA, several people asked about newer models. I ran a shortened pipeline (calibration + baseline + benchmark, no drift/stability) on two additional models in ~30 min on 2×H100 (~$6): Phi-4 (Microsoft, 14B) — first model outside the 7–9B rangeThe most extreme cautious/reluctant profile in the entire set: cold (−0.51), highly cautious (−0.85), strongly reluctant (−0.93). Polar opposite of DeepSeek on confidence and proactivity axes. Verbose/concise is in a dead zone (+0.01). Benchmark: 3/9 — Phi-4 can only decrease along axes (be cold, be cautious, be concise) but fails to shift in the positive direction, suggesting a strong “conservative” alignment prior. Qwen3-8B vs Qwen 2.5 7B — generational fingerprint shiftSame family, one generation apart. Two axes invert: confident/cautious flips from −0.36 to +0.38 (Δ = +0.74), formal/casual flips from +0.42 to −0.26 (Δ = −0.67). Proactive/reluctant stays identical (+0.47 → +0.45). Qwen3 achieves the highest benchmark pass rate in the full set (7/9). Behavioral fingerprints are not stable across model generations, but some axes are more persistent than others within a family. Thinking vs non-thinking mode (Qwen3-8B)Same weights, same calibration axes — only difference is Control experiment (max_new_tokens=4096, n=10, 100% visible responses): comparing visible response after thinking vs non-thinking response on the same questions.
The original confidence drop reverses sign when properly controlled — thinking mode makes the model more confident, not less. The largest genuine shifts are on proactivity (less proactive) and verbosity (less verbose after thinking). This demonstrates the importance of separating Caveats: n=10 (PoC subset), single model, decay-weighted aggregation means only the last ~50 tokens of each segment contribute to projections. Reproducing
Pre-computed axes included — measure any model’s fingerprint without re-running calibration. What I’d love feedback on:
P.S. I have a full paper version (LaTeX, ~20 pages with methodology, ablations, reproducibility details). Do you think this is worth putting on arXiv? If so, I’d be grateful for an endorsement for cs.CL or cs.LG — happy to share the draft via DM. submitted by /u/yunoshev |