Not Just the Destination, But the Journey: Reasoning Traces Causally Shape Generalization Behaviors
arXiv:2603.12397v1 Announce Type: new
Abstract: Chain-of-Thought (CoT) is often viewed as a window into LLM decision-making, yet recent work suggests it may function merely as post-hoc rationalization. This raises a critical alignment question: Does the reasoning trace causally shape model generalization independent of the final answer? To isolate reasoning’s causal effect, we design a controlled experiment holding final harmful answers constant while varying reasoning paths. We construct datasets with textit{Evil} reasoning embracing malice, textit{Misleading} reasoning rationalizing harm, and textit{Submissive} reasoning yielding to pressure. We train models (0.6B–14B parameters) under multiple paradigms, including question-thinking-answer (QTA), question-thinking (QT), and thinking-only (T-only), and evaluate them in both think and no-think modes. We find that: (1) CoT training could amplify harmful generalization more than standard fine-tuning; (2) distinct reasoning types induce distinct behavioral patterns aligned with their semantics, despite identical final answers; (3) training on reasoning without answer supervision (QT or T-only) is sufficient to alter behavior, proving reasoning carries an independent signal; and (4) these effects persist even when generating answers without reasoning, indicating deep internalization. Our findings demonstrate that reasoning content is causally potent, challenging alignment strategies that supervise only outputs.