Session 1: Generative AI for Learning & Teaching

ECTEL logo
<< Return to Programme Session 1: Generative AI for Learning & Teaching Chair: Marco Kalz 11:00-11:30 WEST Evaluating ChatGPT’s Decimal Skills and Feedback Generation to Students’ Self-explanations in a Digital Learning Game Hayden Stec, Huy Nguyen, Xinying Hou, Sarah Di and Bruce McLaren Abstract: While open-ended self-explanations have been shown

Speakers

Bruce McLaren
Carnegie Mellon University, USA
neutral portrait picture
Yongchao Wu
Stockholm University, Sweden
Steven James Moore
Carnegie Mellon University, USA
neutral portrait picture
Ishari Amarasinghe
Radboud University, The Netherlands
Universitat Pompeu Fabra, Spain

Start

06/09/2023 - 11:00

End

06/09/2023 - 13:00

Address

Auditorium   View map

<< Return to Programme

Session 1: Generative AI for Learning & Teaching

Chair: Marco Kalz

11:00-11:30 WEST
Evaluating ChatGPT’s Decimal Skills and Feedback Generation to Students’ Self-explanations in a Digital Learning Game

Hayden Stec, Huy Nguyen, Xinying Hou, Sarah Di and Bruce McLaren

Abstract: While open-ended self-explanations have been shown to promote robust learning in multiple studies, they pose significant challenges to automated grading and feedback in technology-enhanced learning, due to the unconstrained nature of the students’ input. Our work investigates whether recent advances in Large Language Models, particularly ChatGPT, can address this issue. Using decimal exercises and student data from a prior study of the learning game Decimal Point, with more than 5000 open-ended self-explanation responses, we investigate ChatGPT’s capability in (1) solving the in-game exercises, (2) determining the correctness of students’ answers, and (3) providing meaningful feedback to incorrect answers. Our results showed that ChatGPT can respond well to conceptual questions, but struggled with decimal place values and number line problems. In addition, it was able to accurately assess the correctness of 75% of the students’ answers and generated generally high-quality feedback, similar to human instructors. We conclude with a discussion of ChatGPT’s strengths and weaknesses and suggest several venues for extending its use cases in digital teaching and learning.

📄 Read More: https://link.springer.com/chapter/10.1007/978-3-031-42682-7_19


11:30-12:00 WEST
Towards Improving the Reliability and Transparency of ChatGPT for Educational Question Answering

Yongchao Wu, Aron Henriksson, Martin Duneld and Jalal Nouri

Abstract: Large language models (LLMs), such as ChatGPT, have shown remarkable performance on various natural language processing (NLP) tasks, including educational question answering (EQA). However, LLMs generate text entirely based on knowledge obtained during pre-training, which means they struggle with recent information or domain-specific knowledge bases. Moreover, only providing answers to questions posed to LLMs without any grounding materials makes it difficult for students to judge their validity. We therefore propose a method for integrating information retrieval systems with LLMs when developing EQA systems, which in addition to improving EQA performance grounds the answers in the educational context. Our experiments show that the proposed system outperforms vanilla ChatGPT with a vast margin of 110.9%, 67.8%, and 43.3% on BLEU, ROUGE, and METEOR scores. In addition, we argue that the use of the retrieved educational context enhances the transparency and reliability of the EQA process, making it easier to determine the correctness of the answers.

📄 Read More: https://link.springer.com/chapter/10.1007/978-3-031-42682-7_32


12:00-12:30 WEST
Assessing the Quality of Multiple-Choice Questions: Automated Methods for Identifying Item-Writing Flaws

Steven Moore, Huy Nguyen, John Stamper and Tianying Chen

Abstract: Multiple-choice questions with item-writing flaws can negatively impact student learning and skew analytics. These flaws are often present in student-generated questions, making it difficult to assess their quality and suitability for classroom usage. Existing methods for evaluating multiple-choice questions often focus on machine readability metrics, without considering their intended use within course materials and their pedagogical implications. In this study, we compared the performance of a rule-based method we developed to a machine-learning based method utilizing GPT-4 for the task of automatically assessing multiple-choice questions based on 19 common item-writing flaws. By analyzing 200 student-generated questions from four different subject areas, we found that the rule-based method correctly detected 91% of the flaws identified by human annotators, as compared to 79% by GPT-4. We demonstrated the effectiveness of the two methods in identifying common item-writing flaws present in the student-generated questions across different subject areas. The rule-based method can accurately and efficiently evaluate multiple-choice questions from multiple domains, outperforming GPT-4 and going beyond existing metrics that do not account for the educational use of such questions. Finally, we discuss the potential for using these automated methods to improve the quality of questions based on the identified flaws.

📄 Read More: https://link.springer.com/chapter/10.1007/978-3-031-42682-7_16


12:30-13:00 WEST
Generative Pre-trained Transformers for Coding Text Data? An Analysis with Classroom Orchestration Data

Ishari Amarasinghe, Francielle Marques, Ariel Ortiz-Beltran and Davinia Hernández-Leo

Abstract: Content analysis is of importance for researchers in technology-enhanced learning. Text transcripts, for example those obtained from video recordings, enables the application of a coding scheme to group the text into categories that highlight the key themes. However, manually coding text into codes is demanding and requires the time and effort of human annotators. Therefore, this study explores the possibility of using Generative Pre-trained Transformer 3 (GPT-3) models for automating the text data coding compared to baseline classical machine learning approaches using a dataset manually coded for the orchestration actions of six teachers in classroom collaborative learning sessions. The findings of our study showed that a fine-tuned GPT-3 (curie) model outperformed classical approaches (F1 score of 0.87) and reached a 0.77 Cohen’s kappa, which indicated a moderate agreement between manual and machine coding. The study also brings out the limitations of our text transcripts and highlights the importance of multimodal observations that capture the context of orchestration actions.

📄 Read More: https://link.springer.com/chapter/10.1007/978-3-031-42682-7_3