Large Language Models Predict Functional Outcomes after Acute Ischemic Stroke
arXiv:2602.10119v1 Announce Type: new
Abstract: Accurate prediction of functional outcomes after acute ischemic stroke can inform clinical decision-making and resource allocation. Prior work on modified Rankin Scale (mRS) prediction has relied primarily on structured variables (e.g., age, NIHSS) and conventional machine learning. The ability of large language models (LLMs) to infer future mRS scores directly from routine admission notes remains largely unexplored. We evaluated encoder (BERT, NYUTron) and generative (Llama-3.1-8B, MedGemma-4B) LLMs, in both frozen and fine-tuned settings, for discharge and 90-day mRS prediction using a large, real-world stroke registry. The discharge outcome dataset included 9,485 History and Physical notes and the 90-day outcome dataset included 1,898 notes from the NYU Langone Get With The Guidelines-Stroke registry (2016-2025). Data were temporally split with the most recent 12 months held out for testing. Performance was assessed using exact (7-class) mRS accuracy and binary functional outcome (mRS 0-2 vs. 3-6) accuracy and compared against established structured-data baselines incorporating NIHSS and age. Fine-tuned Llama achieved the highest performance, with 90-day exact mRS accuracy of 33.9% [95% CI, 27.9-39.9%] and binary accuracy of 76.3% [95% CI, 70.7-81.9%]. Discharge performance reached 42.0% [95% CI, 39.0-45.0%] exact accuracy and 75.0% [95% CI, 72.4-77.6%] binary accuracy. For 90-day prediction, Llama performed comparably to structured-data baselines. Fine-tuned LLMs can predict post-stroke functional outcomes from admission notes alone, achieving performance comparable to models requiring structured variable abstraction. Our findings support the development of text-based prognostic tools that integrate seamlessly into clinical workflows without manual data extraction.