Can Machines Think? Assessing the Accuracy of GenAI Chatbots in a Physics University Entrance Exam
Keywords:
Chatbots; ChatGPT; Gemini; Generative Artificial Intelligence; Webb’s Depth of KnowledgeAbstract
The rapid advancement of artificial intelligence (AI) over the past few years has led to the development of Generative AI (GenAI) tools with enhanced capabilities, including multimodal functionality, reduced susceptibility to hallucinations, and real-time access to internet resources. Past studies have revealed that GenAI tools are used in daily life across various fields, including education, healthcare, engineering, and software development. School learners are increasingly relying on them for their academic activities. There is a paucity of empirical research on the accuracy of these tools' responses, particularly in the field of physics education. This mixed-method case study aims to evaluate the accuracy of responses from ChatGPT and Google Gemini chatbots in answering physics university entrance exams in South Africa. Technological Pedagogical Content Knowledge and Webb’s Depth of Knowledge were used to construct the theoretical framework. The research instrument used in this study was the 2024 university entrance physics exam paper in South Africa. The question paper was loaded into each chatbot, then they were prompted to respond to the questions. Two expert examiners assessed the responses of the chatbots. The performance of each chatbot was compared to that of the learners who took the exams. The findings were that the chatbots outperformed the learners. This study's findings suggest that these chatbots can serve as teaching assistants to support learners in exam preparation and formative assessment tasks. Learners should employ critical thinking skills to assess the responses they receive from chatbots during interactions.
https://doi.org/10.26803/ijlter.25.1.17
References
Ahmed, J., Nadeem, G., Majeed, M. K., Ghaffar, R., Baig, A. K. K., Shah, S. R., Razzaq, R. A., & Irfan, T. (2025). The Rise of Multimodal AI: A Quick Review Of GPT-4v and Gemini. Spectrum of Engineering Sciences, 3(6), 778-786. https://thesesjournal.com/index.php/1/article/view/506/452
Al-Thani, S. N., Anjum, S., Bhutta, Z. A., Bashir, S., Majeed, M. A., Khan, A. S., & Bashir, K. (2025). Comparative performance of ChatGPT, Gemini, and final-year emergency medicine clerkship students in answering multiple-choice questions: implications for the use of AI in medical education. International Journal of Emergency Medicine, 18, 146. https://doi.org/10.1186/s12245-025-00949-6
Chang, D. H., Lin, M. P.-C., Hajian, S., & Wang, Q. Q. (2023). Educational design principles of using AI chatbot that supports self-regulated learning in education: Goal setting, feedback, and personalization. Sustainability, 15(17), 12921. https://doi.org/10.3390/su151712921
Chapagain, P., Malakar, N., & Rimal, D. (2024). Can AI solve physics problems? Evaluating efficacy of AI models in solving higher secondary physics exam problems: A comparative study. Journal of Nepal Physical Society, 10(1), 58-64. https://doi.org/10.3126/jnphyssoc.v10i1.72836
Chen, L., Chen, P., & Lin, Z. (2020). Artificial intelligence in education: A review. IEEE Access, 8, 75264-75278. https://doi.org/10.1109/access.2020.2988510
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1), 37-46. https://doi.org/10.1177/001316446002000104
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., & Rosen, E. (2025). Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261. arxiv.org/pdf/2507.06261
Crowe, S., Cresswell, K., Robertson, A., Huby, G., Avery, A., & Sheikh, A. (2011). The case study approach. BMC Medical Research Methodology, 11(1). https://doi.org/10.1186/1471-2288-11-100
Demirci, N. (2025). How Successful Are Artificial Intelligence Chatbots on Higher Education Entrance Physics Exams in Turkey. TOJET: The Turkish Online Journal of Educational Technology, 24(2). researchgate.net/profile/Neset-Demirci/publication/392059590_How_Successful_are_Artificial_Intelligence_Chatbots_on_Higher_Education_Entrance_Physics_Exams_in_Turkey/links/68319a696b5a287c304450a3/How-Successful-are-Artificial-Intelligence-Chatbots-on-Higher-Education-Entrance-Physics-Exams-in-Turkey.pdf
Department of Basic Education. (2024). Previous exam papers (Gr 10, 11 & 12). Pretoria. https://www.education.gov.za/Portals/0/CD/2024%20November%20past%20papers/Physical%20Sciences%20P1%20Nov%202024%20Eng.pdf?ver=2025-03-04-112701-620
Jere, S. (2025). Evaluating artificial intelligence large language models’ performances in a South African high school chemistry exam. EURASIA Journal of Mathematics, Science and Technology Education, 21(2), em2582. https://doi.org/10.29333/ejmste/15932
Jere, S., & Mpeta, M. (2025). Integrating generative artificial intelligence chatbots into chemistry teaching: Impact of affective factors on engagement and conceptual understanding. Eurasia Journal of Mathematics, Science and Technology Education, 21(10), em2713. https://doi.org/10.29333/ejmste/17077
Jere, S., Bessong, R., Mpeta, M., & Litshani, N. F. (2024). Exploring Pre-Service Teachers’ Perceptions of ChatGPT Integration into Physical Sciences Teaching: A Case Study at a Rural South African University. International Journal of Learning, Teaching and Educational Research, 23(11), 464-486. https://doi.org/10.26803/ijlter.23.11.24
Khlaif, Z. N., Alkouk, W. A., Salama, N., & Abu Eideh, B. (2025). Redesigning assessments for AI-enhanced learning: A framework for educators in the generative AI era. Education sciences, 15(2), 174. https://doi.org/10.3390/educsci15020174
Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of chiropractic medicine, 15(2), 155-163. https://doi.org/10.1016/j.jcm.2016.02.012
Kooli, C. (2023). Chatbots in education and research: A critical examination of ethical implications and solutions. Sustainability, 15(7), 5614. https://doi.org/10.3390/su15075614
Kuhail, M. A., Alturki, N., Alramlawi, S., & Alhejori, K. (2023). Interacting with educational chatbots: A systematic review. Education and Information Technologies, 28(1), 973-1018. https://doi.org/10.1007/s10639-022-11177-3
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 159-174. https://doi.org/10.2307/2529310
Liu, M., Okuhara, T., Dai, Z., Zhao, M., Yin, W., Okada, H., Furukawa, E., & Kiuchi, T. (2025). Large language models (GPT-5, Grok-4, Claude Opus 4.1, Gemini 2.5 Pro) achieved textbook-level accuracy on the Japanese medical licensing examination by 2025: A comparative study. medRxiv, 2025.2009. 2010.25335398. medrxiv.org/content/10.1101/2025.09.10.25335398v1.full.pdf
López-Simó, V., & Rezende, M. F. (2024). Challenging ChatGPT with different types of physics education questions. The Physics Teacher, 62(4), 290-294. https://doi.org/10.1119/5.0160160
Marzano, R. J., & Kendall, J. S. (2006). The new taxonomy of educational objectives. Corwin Press. ifeet.org/files/The-New-taxonomy-of-Educational-Objectives.pdf
Matejak Cvenic, K., Planinic, M., Susac, A., Ivanjek, L., Jelicic, K., & Hopf, M. (2022). Development and validation of the Conceptual Survey on Wave Optics. Physical Review Physics Education Research, 18(1), 010103. https://doi.org/10.1103/physrevphyseducres.18.010103
Mishra, P., & Koehler, M. J. (2006). Technological pedagogical content knowledge: A framework for teacher knowledge. Teachers College Record, 108(6), 1017-1054. https://doi.org/10.1177/016146810610800610
Newton, P. M., Summers, C. J., Zaheer, U., Xiromeriti, M., Stokes, J. R., Bhangu, J. S., Roome, E. G., Roberts-Phillips, A., Mazaheri-Asadi, D., & Jones, C. D. (2025). Can ChatGPT-4o really pass medical science exams? A pragmatic analysis using novel questions. Medical Science Educator, 35(2), 721-729. https://doi.org/10.1007/s40670-025-02293-z
OpenAI. (2025a). GPT-5 System Card. OpenAI. Retrieved 11 August 2025 from https://cdn.openai.com/pdf/8124a3ce-ab78-4f06-96eb-49ea29ffb52f/gpt5-system-card-aug7.pdf
OpenAI. (2025b). Introducing GPT-5. OpenAI. Retrieved 11 August from https://openai.com/
Plevris, V., Papazafeiropoulos, G., & Jiménez Rios, A. (2023). Chatbots put to the test in math and logic problems: A comparison and assessment of ChatGPT-3.5, ChatGPT-4, and Google Bard. AI, 4(4), 949-969. https://doi.org/10.3390/ai4040048
Polverini, G., & Gregorcic, B. (2024). How understanding large language models can inform the use of ChatGPT in physics education. European Journal of Physics, 45(2), 025701. https://doi.org/10.1088/1361-6404/ad1420
Rane, N., Choudhary, S., & Rane, J. (2024). Gemini versus ChatGPT: applications, performance, architecture, capabilities, and implementation. Journal of Applied Artificial Intelligence, 5(1), 69-93. https://doi.org/10.48185/jaai.v5i1.1052
Seufert, S., Guggemos, J., & Sailer, M. (2021). Technology-related knowledge, skills, and attitudes of pre-and in-service teachers: The current situation and emerging trends. Computers in Human Behavior, 115, 106552. https://doi.org/10.1016/j.chb.2020.106552
Tang, K.-S., Cooper, G., Rappa, N., Cooper, M., Sims, C., & Nonis, K. (2024). A dialogic approach to transform teaching, learning & assessment with generative AI in secondary education: A proof of concept. Pedagogies: An International Journal, 19(3), 493-503. https://doi.org/10.1080/1554480x.2024.2379774
Tong, D., Tao, Y., Zhang, K., Dong, X., Hu, Y., Pan, S., & Liu, Q. (2024). Investigating ChatGPT-4’s performance in solving physics problems and its potential implications for education. Asia Pacific Education Review, 25(5), 1379-1389. https://doi.org/10.1007/s12564-023-09913-6
Tschisgale, P., Maus, H., Kieser, F., Kroehs, B., Petersen, S., & Wulff, P. (2025). Evaluating GPT-and reasoning-based large language models on Physics Olympiad problems: Surpassing human performance and implications for educational assessment. Physical Review Physics Education Research, 21(2), 020115. https://doi.org/10.1103/6fmx-bsnl
Turing, A. M. (2009). Computing machinery and intelligence. In Parsing the Turing test: Philosophical and methodological issues in the quest for the thinking computer (pp. 23-65). Springer. https://doi.org/10.1007/978-1-4020-6710-5_3
Webb, N. L. (2002). Depth-of-Knowledge Levels for Four Content Areas. Language Arts. ossucurr.pbworks.com/w/file/fetch/49691156/Norm web dok by subject area.pdf
Woitkowski, D. (2020). Tracing physics content knowledge gains using content complexity levels. International journal of science education, 42(10), 1585-1608. https://doi.org/10.1080/09500693.2020.1772520
Xuan-Quy, D., Ngoc-Bich, L., Xuan-Dung, P., Bac-Bien, N., & The-Duy, V. (2023). Evaluation of ChatGPT and Microsoft Bing AI chat performances on physics exams of Vietnamese national high school graduation examination. arXiv preprint arXiv:2306.04538. arxiv.org/pdf/2306.04538
Zawacki-Richter, O., Marín, V. I., Bond, M., & Gouverneur, F. (2019). Systematic review of research on artificial intelligence applications in higher education–where are the educators? International Journal of Educational Technology in Higher Education, 16(1), 1-27. https://doi.org/10.1186/s41239-019-0171-0
Zhao, J., Chapman, E., & Sabet, P. G. (2024). Generative AI and educational assessments: A systematic review. Education Research and Perspectives, 51, 124-155. https://doi.org/10.70953/erpv51.2412006
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Samuel Jere

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
All articles published by IJLTER are licensed under a Creative Commons Attribution Non-Commercial No-Derivatives 4.0 International License (CCBY-NC-ND4.0).