Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations
arXiv:2603.03332v1 Announce Type: new
Abstract: Chain-of-Thought (CoT) prompting has emerged as a foundational technique for eliciting reasoning from Large Language Models (LLMs), yet the robustness of this approach to corruptions in intermediate reasoning steps remains poorly understood. This paper presents a comprehensive empirical evaluation of LLM robustness to a structured taxonomy of 5 CoT perturbation types: textit{MathError, UnitConversion, Sycophancy, SkippedSteps,} and textit{ExtraSteps}. We evaluate 13 models spanning three orders of magnitude in parameter count (3B to 1.5Tfootnote{Assumed parameter count of closed models}), testing their ability to complete mathematical reasoning tasks despite perturbations injected at different points in the reasoning chain. Our key findings reveal heterogeneous vulnerability patterns: MathError perturbations produce the most severe degradation in small models (50-60% accuracy loss) but show strong scaling benefits; UnitConversion remains challenging across all scales (20-30% loss even for largest models); ExtraSteps incur minimal accuracy degradation (0-6%) regardless of scale; Sycophancy produces modest effects (7% loss for small models); and SkippedSteps cause intermediate damage (15% loss). Scaling relationships follow power-law patterns, with model size serving as a protective factor against some perturbations but offering limited defense against dimensional reasoning tasks. These findings have direct implications for deploying LLMs in multi-stage reasoning pipelines and underscore the necessity of task-specific robustness assessments and mitigation strategies. The code and results are available href{https://github.com/Mystic-Slice/CoTPerturbation}{here}.