Can Machines Think? Assessing the Accuracy of GenAI Chatbots in a Physics University Entrance Exam

Samuel  Jere

Authors

Samuel Jere

Keywords:

Chatbots; ChatGPT; Gemini; Generative Artificial Intelligence; Webb’s Depth of Knowledge

Abstract

The rapid advancement of artificial intelligence (AI) over the past few years has led to the development of Generative AI (GenAI) tools with enhanced capabilities, including multimodal functionality, reduced susceptibility to hallucinations, and real-time access to internet resources. Past studies have revealed that GenAI tools are used in daily life across various fields, including education, healthcare, engineering, and software development. School learners are increasingly relying on them for their academic activities. There is a paucity of empirical research on the accuracy of these tools' responses, particularly in the field of physics education. This mixed-method case study aims to evaluate the accuracy of responses from ChatGPT and Google Gemini chatbots in answering physics university entrance exams in South Africa. Technological Pedagogical Content Knowledge and Webb’s Depth of Knowledge were used to construct the theoretical framework. The research instrument used in this study was the 2024 university entrance physics exam paper in South Africa. The question paper was loaded into each chatbot, then they were prompted to respond to the questions. Two expert examiners assessed the responses of the chatbots. The performance of each chatbot was compared to that of the learners who took the exams. The findings were that the chatbots outperformed the learners. This study's findings suggest that these chatbots can serve as teaching assistants to support learners in exam preparation and formative assessment tasks. Learners should employ critical thinking skills to assess the responses they receive from chatbots during interactions.

https://doi.org/10.26803/ijlter.25.1.17

References

Ahmed, J., Nadeem, G., Majeed, M. K., Ghaffar, R., Baig, A. K. K., Shah, S. R., Razzaq, R. A., & Irfan, T. (2025). The Rise of Multimodal AI: A Quick Review Of GPT-4v and Gemini. Spectrum of Engineering Sciences, 3(6), 778-786. https://thesesjournal.com/index.php/1/article/view/506/452

Al-Thani, S. N., Anjum, S., Bhutta, Z. A., Bashir, S., Majeed, M. A., Khan, A. S., & Bashir, K. (2025). Comparative performance of ChatGPT, Gemini, and final-year emergency medicine clerkship students in answering multiple-choice questions: implications for the use of AI in medical education. International Journal of Emergency Medicine, 18, 146. https://doi.org/10.1186/s12245-025-00949-6

Chang, D. H., Lin, M. P.-C., Hajian, S., & Wang, Q. Q. (2023). Educational design principles of using AI chatbot that supports self-regulated learning in education: Goal setting, feedback, and personalization. Sustainability, 15(17), 12921. https://doi.org/10.3390/su151712921

Chapagain, P., Malakar, N., & Rimal, D. (2024). Can AI solve physics problems? Evaluating efficacy of AI models in solving higher secondary physics exam problems: A comparative study. Journal of Nepal Physical Society, 10(1), 58-64. https://doi.org/10.3126/jnphyssoc.v10i1.72836

Chen, L., Chen, P., & Lin, Z. (2020). Artificial intelligence in education: A review. IEEE Access, 8, 75264-75278. https://doi.org/10.1109/access.2020.2988510

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1), 37-46. https://doi.org/10.1177/001316446002000104

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., & Rosen, E. (2025). Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261. arxiv.org/pdf/2507.06261

Crowe, S., Cresswell, K., Robertson, A., Huby, G., Avery, A., & Sheikh, A. (2011). The case study approach. BMC Medical Research Methodology, 11(1). https://doi.org/10.1186/1471-2288-11-100

Demirci, N. (2025). How Successful Are Artificial Intelligence Chatbots on Higher Education Entrance Physics Exams in Turkey. TOJET: The Turkish Online Journal of Educational Technology, 24(2). researchgate.net/profile/Neset-Demirci/publication/392059590_How_Successful_are_Artificial_Intelligence_Chatbots_on_Higher_Education_Entrance_Physics_Exams_in_Turkey/links/68319a696b5a287c304450a3/How-Successful-are-Artificial-Intelligence-Chatbots-on-Higher-Education-Entrance-Physics-Exams-in-Turkey.pdf

Department of Basic Education. (2024). Previous exam papers (Gr 10, 11 & 12). Pretoria. https://www.education.gov.za/Portals/0/CD/2024%20November%20past%20papers/Physical%20Sciences%20P1%20Nov%202024%20Eng.pdf?ver=2025-03-04-112701-620

Jere, S. (2025). Evaluating artificial intelligence large language models’ performances in a South African high school chemistry exam. EURASIA Journal of Mathematics, Science and Technology Education, 21(2), em2582. https://doi.org/10.29333/ejmste/15932

Jere, S., & Mpeta, M. (2025). Integrating generative artificial intelligence chatbots into chemistry teaching: Impact of affective factors on engagement and conceptual understanding. Eurasia Journal of Mathematics, Science and Technology Education, 21(10), em2713. https://doi.org/10.29333/ejmste/17077

Jere, S., Bessong, R., Mpeta, M., & Litshani, N. F. (2024). Exploring Pre-Service Teachers’ Perceptions of ChatGPT Integration into Physical Sciences Teaching: A Case Study at a Rural South African University. International Journal of Learning, Teaching and Educational Research, 23(11), 464-486. https://doi.org/10.26803/ijlter.23.11.24

Khlaif, Z. N., Alkouk, W. A., Salama, N., & Abu Eideh, B. (2025). Redesigning assessments for AI-enhanced learning: A framework for educators in the generative AI era. Education sciences, 15(2), 174. https://doi.org/10.3390/educsci15020174

Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of chiropractic medicine, 15(2), 155-163. https://doi.org/10.1016/j.jcm.2016.02.012

Kooli, C. (2023). Chatbots in education and research: A critical examination of ethical implications and solutions. Sustainability, 15(7), 5614. https://doi.org/10.3390/su15075614

Kuhail, M. A., Alturki, N., Alramlawi, S., & Alhejori, K. (2023). Interacting with educational chatbots: A systematic review. Education and Information Technologies, 28(1), 973-1018. https://doi.org/10.1007/s10639-022-11177-3

Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 159-174. https://doi.org/10.2307/2529310

Liu, M., Okuhara, T., Dai, Z., Zhao, M., Yin, W., Okada, H., Furukawa, E., & Kiuchi, T. (2025). Large language models (GPT-5, Grok-4, Claude Opus 4.1, Gemini 2.5 Pro) achieved textbook-level accuracy on the Japanese medical licensing examination by 2025: A comparative study. medRxiv, 2025.2009. 2010.25335398. medrxiv.org/content/10.1101/2025.09.10.25335398v1.full.pdf

López-Simó, V., & Rezende, M. F. (2024). Challenging ChatGPT with different types of physics education questions. The Physics Teacher, 62(4), 290-294. https://doi.org/10.1119/5.0160160

Marzano, R. J., & Kendall, J. S. (2006). The new taxonomy of educational objectives. Corwin Press. ifeet.org/files/The-New-taxonomy-of-Educational-Objectives.pdf

Matejak Cvenic, K., Planinic, M., Susac, A., Ivanjek, L., Jelicic, K., & Hopf, M. (2022). Development and validation of the Conceptual Survey on Wave Optics. Physical Review Physics Education Research, 18(1), 010103. https://doi.org/10.1103/physrevphyseducres.18.010103

Mishra, P., & Koehler, M. J. (2006). Technological pedagogical content knowledge: A framework for teacher knowledge. Teachers College Record, 108(6), 1017-1054. https://doi.org/10.1177/016146810610800610

Newton, P. M., Summers, C. J., Zaheer, U., Xiromeriti, M., Stokes, J. R., Bhangu, J. S., Roome, E. G., Roberts-Phillips, A., Mazaheri-Asadi, D., & Jones, C. D. (2025). Can ChatGPT-4o really pass medical science exams? A pragmatic analysis using novel questions. Medical Science Educator, 35(2), 721-729. https://doi.org/10.1007/s40670-025-02293-z

OpenAI. (2025a). GPT-5 System Card. OpenAI. Retrieved 11 August 2025 from https://cdn.openai.com/pdf/8124a3ce-ab78-4f06-96eb-49ea29ffb52f/gpt5-system-card-aug7.pdf

OpenAI. (2025b). Introducing GPT-5. OpenAI. Retrieved 11 August from https://openai.com/

Plevris, V., Papazafeiropoulos, G., & Jiménez Rios, A. (2023). Chatbots put to the test in math and logic problems: A comparison and assessment of ChatGPT-3.5, ChatGPT-4, and Google Bard. AI, 4(4), 949-969. https://doi.org/10.3390/ai4040048

Polverini, G., & Gregorcic, B. (2024). How understanding large language models can inform the use of ChatGPT in physics education. European Journal of Physics, 45(2), 025701. https://doi.org/10.1088/1361-6404/ad1420

Rane, N., Choudhary, S., & Rane, J. (2024). Gemini versus ChatGPT: applications, performance, architecture, capabilities, and implementation. Journal of Applied Artificial Intelligence, 5(1), 69-93. https://doi.org/10.48185/jaai.v5i1.1052

Seufert, S., Guggemos, J., & Sailer, M. (2021). Technology-related knowledge, skills, and attitudes of pre-and in-service teachers: The current situation and emerging trends. Computers in Human Behavior, 115, 106552. https://doi.org/10.1016/j.chb.2020.106552

Tang, K.-S., Cooper, G., Rappa, N., Cooper, M., Sims, C., & Nonis, K. (2024). A dialogic approach to transform teaching, learning & assessment with generative AI in secondary education: A proof of concept. Pedagogies: An International Journal, 19(3), 493-503. https://doi.org/10.1080/1554480x.2024.2379774

Tong, D., Tao, Y., Zhang, K., Dong, X., Hu, Y., Pan, S., & Liu, Q. (2024). Investigating ChatGPT-4’s performance in solving physics problems and its potential implications for education. Asia Pacific Education Review, 25(5), 1379-1389. https://doi.org/10.1007/s12564-023-09913-6

Tschisgale, P., Maus, H., Kieser, F., Kroehs, B., Petersen, S., & Wulff, P. (2025). Evaluating GPT-and reasoning-based large language models on Physics Olympiad problems: Surpassing human performance and implications for educational assessment. Physical Review Physics Education Research, 21(2), 020115. https://doi.org/10.1103/6fmx-bsnl

Turing, A. M. (2009). Computing machinery and intelligence. In Parsing the Turing test: Philosophical and methodological issues in the quest for the thinking computer (pp. 23-65). Springer. https://doi.org/10.1007/978-1-4020-6710-5_3

Webb, N. L. (2002). Depth-of-Knowledge Levels for Four Content Areas. Language Arts. ossucurr.pbworks.com/w/file/fetch/49691156/Norm web dok by subject area.pdf

Woitkowski, D. (2020). Tracing physics content knowledge gains using content complexity levels. International journal of science education, 42(10), 1585-1608. https://doi.org/10.1080/09500693.2020.1772520

Xuan-Quy, D., Ngoc-Bich, L., Xuan-Dung, P., Bac-Bien, N., & The-Duy, V. (2023). Evaluation of ChatGPT and Microsoft Bing AI chat performances on physics exams of Vietnamese national high school graduation examination. arXiv preprint arXiv:2306.04538. arxiv.org/pdf/2306.04538

Zawacki-Richter, O., Marín, V. I., Bond, M., & Gouverneur, F. (2019). Systematic review of research on artificial intelligence applications in higher education–where are the educators? International Journal of Educational Technology in Higher Education, 16(1), 1-27. https://doi.org/10.1186/s41239-019-0171-0

Zhao, J., Chapman, E., & Sabet, P. G. (2024). Generative AI and educational assessments: A systematic review. Education Research and Perspectives, 51, 124-155. https://doi.org/10.70953/erpv51.2412006

Can Machines Think? Assessing the Accuracy of GenAI Chatbots in a Physics University Entrance Exam

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)