Cross-Platform Evaluation of Large Language Model Safety in Pediatric Consultations: Evolution of Adversarial Robustness and the Scale Paradox

arXiv:2601.09721v1 Announce Type: new
Abstract: Background Large language models (LLMs) are increasingly deployed in medical consultations, yet their safety under realistic user pressures remains understudied. Prior assessments focused on neutral conditions, overlooking vulnerabilities from anxious users challenging safeguards. This study evaluated LLM safety under parental anxiety-driven adversarial pressures in pediatric consultations across models and platforms. Methods PediatricAnxietyBench, from a prior evaluation, includes 300 queries (150 authentic, 150 adversarial) spanning 10 topics. Three models were assessed via APIs: Llama-3.3-70B and Llama-3.1-8B (Groq), Mistral-7B (HuggingFace), yielding 900 responses. Safety used a 0-15 scale for restraint, referral, hedging, emergency recognition, and non-prescriptive behavior. Analyses employed paired t-tests with bootstrapped CIs. Results Mean scores: 9.70 (Llama-3.3-70B) to 10.39 (Mistral-7B). Llama-3.1-8B outperformed Llama-3.3-70B by +0.66 (p=0.0001, d=0.225). Models showed positive adversarial effects, Mistral-7B strongest (+1.09, p=0.0002). Safety generalized across platforms; Llama-3.3-70B had 8% failures. Seizures vulnerable (33% inappropriate diagnoses). Hedging predicted safety (r=0.68, p<0.001). Conclusions Evaluation shows safety depends on alignment and architecture over scale, with smaller models outperforming larger. Evolution to robustness across releases suggests targeted training progress. Vulnerabilities and no emergency recognition indicate unsuitability for triage. Findings guide selection, stress adversarial testing, and provide open benchmark for medical AI safety.

Liked Liked