Musing 105: MedXpertQA : Benchmarking Expert-Level Medical Reasoning and Understanding
ICML paper out of Shanghai Artificial Intelligence Laboratory and Tsinghua
Today’s paper: MedXpertQA : Benchmarking Expert-Level Medical Reasoning and Understanding. Zuo et al. 30 Jan 2025. https://arxiv.org/pdf/2501.18362
With all the excitement around DeepSeek, o1-mini, [take your pick], we should not allow the focus to get away from properly evaluating these models. Already, there is news starting to appear that some model is not as good as it was originally proclaimed, or fails some test, or is more expensive than originally proclaimed…and so on. I am being deliberately vague here because I don’t want to diss any particular model or company. The point is, if we really want to know how good a model is on some dimension or metric, we need to evaluate it systematically.
Evaluations are especially important for medical applications for obvious reasons. So I’m always excited when a new medical LLM benchmark comes out. In my view, there is no such thing as too few benchmarks, especially when it’s so easy to run tests on LLMs. Today’s paper covers such a benchmark. It’s not the first, nor will it be the last, but it is certainly a welcome addition. The paper in which it is covered was also published recently in a top conference in machine learning, so that gives it some extra street cred. Bonus: it cites results from DeepSeek-R1 so it’s not behind the times.
This benchmark is called MedXpertQA and includes 4, 460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. The latter is novel and introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness.
The authors begin the paper by describing some challenges with traditional multimodal medical benchmarks, which are not as applicable to real-world clinical scenarios:
Limited Scope and Insufficient Difficulty. Most current medical benchmarks solely evaluate basic visual perception and medical knowledge, neglecting the complexity of real-world medical tasks across different stages of the diagnosis process. They fail to assess the expert-level knowledge and reasoning ability required for diagnostic decision-making and treatment planning.
Lack of Authenticity and Clinical Relevance. Current benchmarks lack detailed clinical information and rely on automatically generated simple QA paired with isolated medical images, diverging considerably from realistic clinical scenarios. Medical exam questions used in existing text benchmarks present a promising solution, but the field still lacks a systematic and high-quality benchmark.
The authors’ solution is meant to address both challenges. Figure 2 below shows an overview:
MedXpertQA collects questions of 17/25 member board exams (specialties) of the American Board of Medical Specialties to enable evaluation of highly specialized medical scenarios. It also includes structured data such as tables in its questions and answer choices, as well as semi-structured documents. MedXpertQA MM images similarly demonstrate high diversity and wide coverage. The coverage is illustrated below in Figure 3.
Previous benchmarks primarily relied on USMLE questions for training and evaluation but the authors of MedXpertQA expand the scope by including questions from COMLEX, another major medical licensing examination in the U.S., to capture the unique challenges of medical image interpretation in orthopedic practice. To further evaluate multimodal medical capabilities, they also incorporate questions from the American College of Radiology (ACR) DXIT and TXIT exams, the European Board of Radiology (EDiR) exams, and the New England Journal of Medicine (NEJM) Image Challenge.
They then conduct AI Expert Filtering and Human Expert Filtering to identify questions that challenge both humans and AI. Subsequent Similarity Filtering further enhances robustness:
AI Expert Filtering. The authors employ 8 models, divided into basic and advanced, as AI experts to vote on and filter questions. First, each basic AI expert performs 4 sampling attempts for each question. If any expert answers a question correctly in all attempts, the question is deemed too simple and removed. Second, questions that are answered incorrectly by all AI experts are retained. This approach minimizes randomness and effectively differentiates between questions that current AI systems can solve and those that remain challenging.
Human Expert Filtering. The authors use prior and posterior human expert annotations to identify questions that pose challenges to humans. They first assess each question’s posterior difficulty by calculating its Brier score, a widely applied metric of prediction accuracy. A lower Brier score indicates a more accurate overall prediction distribution, suggesting the question is easier. Compared to accuracy, the Brier score accounts for the response rates of all options, providing a more precise difficulty measurement. Subsequently, they normalize the prior difficulty ratings annotated by medical experts and categorize questions into 3 levels, each associated with an adaptive Brier score threshold for stratified sampling. Higher-rated questions are assigned higher Brier score thresholds, with the maximum threshold set at the 25th and the minimum set at the 3rd percentile of all scores. Approximately 16.78% of questions lack the annotations above and thus have not undergone human expert filtering.
Similarity Filtering. A key factor in achieving robust evaluation is ensuring high diversity and avoiding repetitive assessments. Therefore, the authors filter data by identifying outlier question pairs with extremely high semantic and edit distance similarities.
Tables 1 and 2 below compare two subsets of MedXpertQA with existing benchmarks. Traditional multimodal benchmarks have notable discrepancies from real-world clinical tasks, reflected in the limited number of image types, low image-to-question ratios, and automatically generated questions and annotations. Meanwhile, the MMMU (H & M) Series, primarily based on university-level subject exams, falls short in scope, difficulty, and specificity to the medical domain. In contrast, MedXpertQA MM demonstrates advantages in length and image richness. MedXpertQA Text is the first text medical benchmark to purposefully account for medical specialty assessment, supporting evaluations of highly specialized medical scenarios.
Let’s move on experiments aka benchmarking of existing language models on this benchmark. Earlier in the musing, I showed a visualization summarizing what to expect but Tables 3 and 4 below show the main results in more detail. Overall, the low accuracies of evaluated models demonstrate MedXpertQA ’s ability to pose challenges to state-of-the-art models.
Among vanilla LMMs, GPT-4o consistently performs best across all subsets. Gemini-2.0-Flash is the highest-scoring vanilla LMM after GPT-4o with an impressive performance on MedXpertQA MM , highlighting its advantage in multimodal tasks. As expected, on the reasoning test, DeepSeek-R1 shows the strongest performance among LLMs, substantially outperforming other models. However, no one does too well, or at least, performs at a level that we would expect a deployed model to perform at.
In closing, MedXpertQA is clearly a challenging benchmark, even for 16 established LLMs that the authors try it on. It addresses critical gaps in current benchmarks, including limited coverage of medical specialties, insufficient difficulty, and lack of clinical relevance. By incorporating expert-level medical examination questions rooted in comprehensive clinical data, the MM segment of the dataset marks an advancement in multimodal medical benchmarking. The authors mitigate data leakage risk through data synthesis and engage experts to ensure accuracy and validity. A very promising addition to the growing field of medical benchmarking of LLMs.