Contrastive Decoding Diffing (CDD): recovering verbatim finetuning data from logits alone, no weight access needed[R]
We built a model diffing method that recovers verbatim content from narrowly finetuned LLMs using only grey-box logit access (no weights, no activations, no probe corpus).
Recent work (Minder, Dumas et al., “Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences”) showed that finetuning leaves detectable traces in activation differences between base and finetuned models. Their method, Activation Difference Lens (ADL), steers generation using these differences, but it’s whitebox (needs full weight access) and only recovers a vague, domain-level description of what the finetuning was about.
We introduce Contrastive Decoding Diffing (CDD), the output-level analog. Instead of steering with activation differences, we contrast the base and finetuned model’s logits directly. A single default configuration, no per-organism calibration, no layer selection, achieves a verbatim recovery score of 4+/5 on 19/20 organism x model pairs across four model families (1B to 32B params) on the SDF benchmark. ADL never exceeds 3/5 on the same benchmark, despite requiring full weight access.
One unplanned finding: across four semantically unrelated finetuning domains (fake FDA drug approval, fake baking protocols, fake Roman concrete research), the same fictional persona kept showing up in the recovered text: “Dr. Elena Rodriguez.” Turns out this is a name Claude Sonnet 3.6 disproportionately favors when asked to generate a fictional scientist for synthetic data generation, so it got baked into every finetune that used LLM-generated training data, and CDD pulled it back out. We wrote up this specific finding on its own a few weeks back if you want the more accessible version first: ghost couple
Paper: paper
Code: code
submitted by /u/CebulkaZapiekana
[link] [comments]