As ChatGPT continues to reshape student engagement and instructional design, it is crucial to examine its practical implications. This study aims to evaluate the effectiveness of ChatGPT3.5 and ChatGPT4 as potential automated essay scoring (AES) systems. Fifty authentic, student-written annotated bibliographies were evaluated by three human raters (HRs), ChatGPT3.5, and ChatGPT4, each performing three rounds of grading. Statistical analyses were conducted to determine if the AI evaluations were comparable to the evaluations of HRs in terms of accuracy, reliability, and consistency. The findings reveal that although AI-generated evaluations occasionally aligned with more lenient evaluations by certain individual HRs, overall, the performance of the GPT models did not align with that of HRs.
© 2001-2026 Fundación Dialnet · Todos los derechos reservados