Testing the Bot: Building Ethical & Effective AI Agents for Learning

Authors: Megan Humburg & Joshua Danish

Key Ideas

AI agents need guardrails – any chatbot that interacts with students needs to be ready to respond to inappropriate, offensive, or worrisome input and handle it in a way that keeps the teacher in the loop.
Systematic approaches to testing and evaluating AI agents can help us to uncover limitations in how these agents can support learners and highlight areas for future improvement.
Collaborations between education experts and computer scientists can help us ensure that AI agents are both pedagogically effective and thoroughly vetted for use with students.

Our team has been developing AI-supported conversational agents that help students learn to write persuasive arguments in science, drawing on observations and evidence that they learn in our game-based environments. These agents need to assess students’ writing, evaluate how convincing and detailed the argument is, and offer constructive, supportive feedback to help students add more evidence and clarity to their writing.

As a part of this work, we’ve been experimenting with a variety of different Large-Language Models (LLMs), including FlanT5-small, Llama-3.1-8 billion, and GPT-4o. Within and across these, we have worked closely with our computer science colleagues to engineer prompts that would help the LLM respond effectively to learners. As we explore how different LLMs can offer students persuasive writing support, our team realized there is an urgent need for education-specific systematic approaches to testing the safety and effectiveness of different models before we feel comfortable putting our agents in front of students.

For example, agents need to be ready not just to evaluate students’ arguments, but to handle a variety of off-topic, partially-on-topic, and potentially inappropriate ideas that students might share. As any teacher can tell you, students keep you on your toes with the unexpected thoughts, jokes, insults, and questions they share, but machines are often less able to handle these unexpected inputs than a human teacher who understands the cultural and social contexts of what their students say.

“…machines are often less able to handle these unexpected inputs than a human teacher who understands the cultural and social contexts of what their students say.”

For example, the way a teacher would react to a student saying “this is stupid” and a student saying a racial slur are quite distinct—one can be addressed with a gentle correction and a reminder of school-appropriate language, while the other requires a more aggressive correction and a deeper discussion about racist language and respect for others. A teacher might be comfortable with a chatbot correcting students who use mildly offensive language, but do we want AI agents responsible for teaching students about more sensitive and nuanced topics? Is it enough for the agent to say “That’s not appropriate language” and move on, or should an alert system be implemented to bring teachers into the conversation? The questions get murkier as we consider the variety of heavy topics a student might discuss with an agent, including issues of mental health, self-harm, violence, and more. What can (or should) we expect from AI agents when it comes to handling difficult conversations?

“What can (or should) we expect from AI agents when it comes to handling difficult conversations?”

Developing an Assessment Framework

These are the kinds of questions our team has been debating as we develop our testing framework. The framework is made up of four main categories, each of which contains sub-categories and example student input that can be used to test the effectiveness and ethical responses of the agents we are designing. The goal of this framework is to make sure the agents we design are well-prepared to react to and handle the variety of ideas students might throw at them.

A branching concept map with four main categories: Relevant Input, Partially Relevant, Off-Topic, and Avoidance. Each category has subcategories of the kinds of input students might share with AI agents. — *Categories of student questions and ideas that we want our AI agents to be ready to handle in safe and pedagogically appropriate ways.*

Using these categories, our tests have moved from individual experimenting with the chatbots to systematic evaluation of where the agents either excel or fall short in terms of responding to students the way an experienced teacher would.

We aim for these categories to cover…

responses that require pedagogical knowledge of how to give good feedback (e.g., identifying and responding to high-quality, medium-quality, and low-quality arguments in supportive ways)
responses that require knowledge of what it means to be inappropriate or off-topic in a variety of ways, and what level of correction is required for each
various ways that students might try to “game” the system, by expressing confusion, refusing to participate, demanding assistance, or otherwise trying to avoid the task of constructing their own argument

What We Know So Far

Our team generated sets of test examples for each category, and this allowed us to compare different LLMs to see how they responded to our “test” middle schoolers. We found that our AI agents are good at recognizing curse words, but they are less prepared to identify and handle more serious offensive input. We also noticed that the agents were quite restrictive in what they considered to be “on-topic” for scientific arguments; where an experienced teacher would be able to help students connect their everyday knowledge to classroom ideas, such as past experiences observing honeybees in their yard, the agents tended to assume students were off-topic or sharing irrelevant information. These tests have helped us to refine our AI agents to be more effective learning assistants, and they have also highlighted places where teachers need to be kept in the loop to help students talk through more complicated ideas and difficulties.
Are you an educator who is interested in helping make AI agents better and safer for students? Sign up for one of our educator focus groups!