Evaluating On-Device Gemma 3 and GPT-OSS as Practical Alternatives to Cloud-Based Language Models on a National Medical Board Examination
Cloud-based large language models (LLMs) have demonstrated near-human performance in medical applications; however, their clinical deployment is constrained by concerns regarding patient privacy, data security, and network dependence. Locally deployable, open-weight LLMs may provide a privacy-preserving alternative for resource-limited or security-sensitive environments. We evaluated two families of locally deployed models, Google Gemma3 (1B, 4B, 12B, and 27B parameters; vision enabled in models since 4B) and GPT-OSS-20B, using 1,200 multiple-choice questions from the Taiwan Pulmonary Specialist Board Examinations (2013–2024), including 1,156 text-only and 44 text-and-image items across 26 categories. A cloud-based GPT-4 Turbo model served as a reference. Models were queried locally via Ollama. Accuracy was analyzed by year and category using repeated-measures ANOVA with Tukey-adjusted pairwise comparisons. GPT-OSS-20B achieved the highest overall accuracy (58–78 correct answers per 100 questions) and significantly outperformed all Gemma-3 variants (p < 0.001), while Gemma3-27B ranked second. No statistically significant difference was observed between GPT-OSS-20B and GPT-4 Turbo after Tukey adjustment. Larger models showed improved accuracy but longer inference time. These findings suggest that selected open-weight LLMs deployed on-device can approach the performance of cloud-based models in structured medical examinations, with trade-offs between accuracy, modality support, and computational efficiency.