This study investigates the performance of different OpenAI GPT model versions (GPT-3.5 Turbo, GPT-4 Turbo, GPT-4o, o1-preview and GPT-5 at different reasoning effort values) in solving chemistry problems across various educational and conceptual difficulty levels. The evaluation, aimed at assessing both accuracy and reasoning quality, was based on a data set comprising 150 multiple-choice questions and 10 open-ended questions at the high school level, as well as 75 multiple-choice questions, 10 open-ended questions, and 100 stoichiometry exercises at the university level. The results reveal a clear trend of improvement in both accuracy and consistency with successive GPT model versions, with o1-preview and GPT-5 demonstrating the highest overall performance due to their reasoning capabilities. Error analysis shows that, while conceptual understanding is generally strong, computational mistakes remain frequent, particularly in tasks related to chemical equilibrium exercises and redox reaction balancing, though GPT-5 markedly reduced these errors compared to earlier models. Additionally, misinterpretations of questions requiring judgment or historical context have emerged as a recurring issue. While prompt formulation influences performance in specific contexts, such as redox balancing, the overall sophistication of the model appears to be the primary determinant of performance. These findings suggest that recent advancements in large language models have significantly enhanced their potential for chemistry education, although careful oversight remains necessary to address numerical inaccuracies and interpretative limitations.
© 2001-2026 Fundación Dialnet · Todos los derechos reservados