Performance of Large Language Models on Cognitive Aptitude Testing: A Multi-Run Evaluation on the German Medical School Admission Test (TMS)

Henrik Stelling; Armin Kraus; Gerrit Grieb; Ibrahim Güler

Performance of Large Language Models on Cognitive Aptitude Testing: A Multi-Run Evaluation on the German Medical School Admission Test (TMS)

Henrik Stelling ; Armin Kraus ^[1] ; Gerrit Grieb ^[2] ; Ibrahim Güler ^[1]
1. [1] Otto-von-Guericke University Magdeburg
  
  Otto-von-Guericke University Magdeburg
  
  Landeshauptstadt Magdeburg, Alemania
2. [2] Rheinisch-Westfälische Technische Hochschule Aachen University
  
  Rheinisch-Westfälische Technische Hochschule Aachen University
  
  Städteregion Aachen, Alemania
Localización: EJIHPE: European Journal of Investigation in Health, Psychology and Education, ISSN 2174-8144, ISSN-e 2254-9625, Vol. 16, Nº. 2, 2026, págs. 23-23
Idioma: inglés
DOI: 10.3390/ejihpe16020023
Enlaces
- Texto completo
Resumen
- Background and Objectives: Large language models (LLMs) have demonstrated high performance on knowledge-based medical examinations but their capabilities on cognitive aptitude tests emphasizing reasoning and abstraction remain underexplored. The Test for Medical Studies (TMS), a German medical school admission test, provides a standardized framework to examine these capabilities. This study aimed to evaluate the performance and consistency of multiple LLMs on text-based and visual-analytic TMS items. Materials and Methods: Eight contemporary LLMs, comprising proprietary and open-source systems, were evaluated using a multi-run design on standardized TMS items spanning text-based and visual-analytic cognitive domains. Results: Mean accuracy remained substantially below levels typically reported for knowledge-based medical examinations, with marked performance differences between text-based and visual-analytic subtests. Open-source models performed competitively compared with proprietary systems. Inter-run reliability was heterogeneous, indicating notable variability across repeated evaluations. Conclusions: Current LLMs show limited and domain-dependent performance on cognitive aptitude tasks relevant to medical school admission. High accuracy on knowledge-based examinations does not translate into stable performance on aptitude tests emphasizing fluid intelligence. The observed modality-dependent performance patterns and inter-run variability highlight the importance of differentiated, multi-run evaluation strategies when assessing LLMs for applications in medical education.
Referencias bibliográficas
- Alvarado Gonzalez, M. A., Bruno Hernandez, M., Peñaloza Perez, M. A., Lopez Orozco, B., Cruz Soto, J. T., & Malagon, S. (2025). Do repetitions...
- Bicknell, B. T., Butler, D., Whalen, S., Ricks, J., Dixon, C. J., Clark, A. B., Spaedy, O., Skelton, A., Edupuganti, N., Dzubinski, L., Tate,...
- Buckley, T. A., Crowe, B., Abdulnour, R. E., Rodman, A., & Manrai, A. K. (2025). Comparison of frontier open-source and proprietary large...
- Chen, L., Zaharia, M., & Zou, J. (2023). How is ChatGPT’s behavior changing over time? arXiv, arXiv:2307.09009.
- Chen, Y., Huang, X., Yang, F., Lin, H., Lin, H., Zheng, Z., Liang, Q., Zhang, J., & Li, X. (2024). Performance of ChatGPT and Bard on...
- Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
- Elkin, P. L., Mehta, G., LeHouillier, F., Resnick, M., Mullin, S., Tomlin, C., Resendez, S., Liu, J., Nebeker, J. R., & Brown, S. H. (2025)....
- Gilson, A., Safranek, C. W., Huang, T., Socrates, V., Chi, L., Taylor, R. A., & Chartash, D. (2023). How does ChatGPT perform on the United...
- Hell, B., Trapmann, S., & Schuler, H. (2007). Eine metaanalyse der validität von fachspezifischen studierfähigkeitstests im deutschsprachigen...
- Ilić, D., & Gignac, G. E. (2024). Evidence of interrelated cognitive-like capabilities in large language models: Indications of artificial...
- ITB-Academic Tests. (n.d.). TMS—Test für medizinische Studiengänge. Available online: https://itb-academic-tests.org/hochschulvertreter/tms...
- Kadmon, G., & Kadmon, M. (2016). Academic performance of students with the highest and mediocre school-leaving grades: Does the aptitude...
- Kung, T. H., Cheatham, M., Medenilla, A., Sillos, C., De Leon, L., Elepaño, C., Madriaga, M., Aggabao, R., Diaz-Candido, G., Maningo, J.,...
- Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.
- Li, C., Wu, W., Zhang, H., Li, Q., Gao, Z., Xia, Y., Hernández-Orallo, J., Vulić, I., & Wei, F. (2025). 11Plus-Bench: Demystifying multimodal...
- Liu, M., Okuhara, T., Chang, X., Shirabe, R., Nishiie, Y., Okada, H., & Kiuchi, T. (2024). Performance of ChatGPT across different versions...
- López Espejel, J., Ettifouri, E. H., Alassan, M. S. Y., Chouham, E. M., & Dahhane, W. (2023). GPT-3.5, GPT-4, or BARD? Evaluating LLMs’...
- Mavrych, V., Yaqinuddin, A., & Bolgova, O. (2025). Claude, ChatGPT, Copilot, and Gemini performance versus students in different topics...
- Meyer, A., Riese, J., & Streichert, T. (2024). Comparison of the performance of GPT-3.5 and GPT-4 with that of medical students on the...
- Nori, H., King, N., Mayer McKinney, S., Carignan, D., & Horvitz, E. (2023). Capabilities of GPT-4 on medical challenge problems. arXiv,...
- Schult, J., Hofmann, A., & Stegt, S. J. (2019). Leisten fachspezifische studierfähigkeitstests im deutschsprachigen raum eine valide studienerfolgsprognose?...
- Test für Medizinische Studiengänge (TMS). (n.d.). Results and scoring of the test for medical studies (TMS). TMS-info. Available online: https://www.tms-info.org/ergebnis-und-auswertung/...
- TMS. (n.d.). Informationen der am TMS beteiligten organisationen. Available online: https://www.tms-info.org/informationen-der-am-tms-beteiligten-fakultaeten...
- Trost, G. (1992). Erfahrungen mit dem test für medizinische Studiengänge (TMS). Medizinische Ausbildung, 9(2), 67–76. Available online: https://gesellschaft-medizinische-ausbildung.org/files/ZMA-Archiv/1992/2/Trost_G-n.pdf...
- Yang, Y., Chen, M., Liu, Q., Hu, M., Chen, Q., Zhang, G., Hu, S., Zhai, G., Qiao, Y., Wang, Y., Shao, W., & Luo, P. (2025). Truly assessing...
- Zimmerhofer, A., Hofmann, A., Wittenberg, T., Amelung, D., & Kadmon, M. (2019, September 17). Test for medical studies (TMS): Testing...
- Zong, H., Wu, R., Cha, J., Wang, J., Wu, E., Li, J., Zhou, Y., Zhang, C., Feng, W., & Shen, B. (2024). Large language models in worldwide...