MedCalc-Bench Doesn’t Measure What You Think: A Benchmark Audit and the Case for Open-Book Evaluation
arXiv:2603.02222v1 Announce Type: new
Abstract: MedCalc-Bench is a widely used benchmark for evaluating LLM performance on clinical calculator
tasks, with state-of-the-art direct prompting scores plateauing around 35% on the Verified split
(HELM MedHELM leaderboard) and the best published approach-RL with verifiable rewards-reaching 74%.
We present three contributions that challenge the benchmark’s current framing. First, we conduct a
systematic audit of the benchmark’s calculator implementations, identifying and fixing over 20
errors ranging from critical formula inaccuracies to runtime bugs in a NeurIPS-published dataset.
Second, we show that a simple intervention-providing the model with the calculator specification at
inference time (“open-book” prompting)-raises accuracy from ~52% to 81-85% on GLM-4.6V and GLM-4.7,
surpassing all published results including RL-trained systems, without any fine-tuning. Third, we
establish an upper bound of 95-97% using GPT-5.2-Thinking, with residual errors attributable
primarily to ground-truth issues and dataset ambiguities. Our findings suggest that MedCalc-Bench
predominantly measures formula memorization and arithmetic precision rather than clinical
reasoning, and would be better framed as a tool-use evaluation.