MedPI: Evaluating AI Systems in Medical Patient-facing Interactions
arXiv:2601.04195v1 Announce Type: new Abstract: We present MedPI, a high-dimensional benchmark for evaluating large language models (LLMs) in patient-clinician conversations. Unlike single-turn question-answer (QA) benchmarks, MedPI evaluates the medical dialogue across 105 dimensions comprising the medical process, treatment safety, treatment outcomes and doctor-patient communication across a granular, accreditation-aligned rubric. MedPI comprises five layers: (1) Patient Packets (synthetic EHR-like ground truth); (2) an AI Patient instantiated through an LLM with memory and affect; (3) a Task Matrix spanning encounter […]