Automated Grading Done Right

The Khan Academy dashboard is meant to provide students and teachers with information that can help them target where a student is struggling, and improve. Unfortunately, the data given isn’t what’s useful to teachers, just what’s easy for computers to measure. These metrics include time spent watching and rewinding videos, time spent on different topics (broken up by videos and quizzes), and proficiency levels in exercises. But as one of the teachers I follow on Twitter points out, these programs don’t let him “SEE my students’ work so I can know HOW/WHY they got questions wrong.” attempts to counteract that. Math teachers send in anonymized samples of student errors that they find telling, common, or inscrutable. Teachers comment as to what they think the student is not understanding and how to fix it. In the process, newer teachers get to see the thought process of their more experienced colleagues. There are patterns to these mistakes, so can a computer be programmed to recognize them?

The most obvious example, which I doubt I’m the first to come up with, is to anticipate patterns of wrong answers. Let’s say you’re testing a physics problem where students need to plug values into an equation. (Yes, this is a naive view of physics but go with it.) Have experienced teachers compile patterns of mistakes students are likely to make: forget to square something, leave off the constant, divide instead of multiply, use a different formula, and so forth. Then the grading software picks new values to plug into the formula, and calculates all the wrong answers for these values (picking numbers so that none of the wrong answer patterns lead to the same numeric answer). Then, if the student gives any of the anticipated wrong answers, the program knows exactly what mistake the student made and can correct them. Hopefully finding a mechanistic error will provide the human teacher with a window into finding and fixing a qualitative misconception.

Let’s take a more complex, real-world example. In some computer science classes at Tufts, the programs written for homework are subjected to a battery of tests, written by both the professor and the students (and their predecessors). In one case, the assignment is to create a programming language interpreter that determines the “type” of pieces of code. For example, it needs to know that true is a boolean, 7 is an integer, and asking if true equals 7 is an error. To clarify, there’s the code the students write (called an interpreter) and then the code that it tries to type, as a test. An interpreter can fail in a number of ways: it can find the wrong type, find a type when it should raise an error, raise and error when it should find a type, stop unexpectedly with an exception, or never stop at all.

I know that’s a bit much to wrap your head around, but (1) that’s the sort of complexity we’re up against and (2) it’s not just an example, it’s a case study. I have a visualization for this data already made, as a class project. My group wanted to take this data and provide actionable reccomendations for the professor, to be able to say, “you’re not handling this properly” or “you don’t understand this very specific detail”. So we hand-built an automated classifier using what we knew about the errors. Here’s part of the visualization we came up with.

Circle errors

The size of each circle represents the number of students who failed at least one test with that error. The vertical position of the circle corresponds to the average number of tests passed by the students who got that error. Colors encode categories, and the horizontal spread means nothing (just a way to prevent overlap). Click on a circle and you’ll get:

Error bars by student

Each bar is a student, identified by an anonymized hash. Their errors are grouped together, with the taller bars being the error we have selected. On the real thing (not these images), you can click on any other bar to jump to that error. Hover over the bar and move around to show each of the tests failed with that error below the circles. This highly-specific information allows the user to look at the individual tests and hypothesize the underlying cause of the error.

You can play with the interactive visualization online here.

Education isn’t a no-computers-allowed clubhouse, but software developers are must be smarter about how we approach these tools. Programmers need to work with educators and fill their needs, not just offer up whatever statistics are easy for them to collect. We have powerful tools like machine learning and visualizations, and teachers with decades of experience. We can make useful automated systems, if we stop acting as if it’s a trivial job.

And yet… all of this takes the views education as a series of questions with right and wrong answers. This is largely true in the STEM fields and even in the humanities, but not in the arts. There really is no good way to automate grading of the arts (or to grade the arts, for that matter). We need to nurture our future artists, but more importantly, we need to teach the skills necessary to appreciate the arts, and to dabble in them. As part of the human condition, we all find ourselves with emotions and ideas that we need to express, through music, painting, or writing (writes the blogger). The fact that very few people will appreciate these works is fine; what matters is the catharsis they give to their creator. That’s something no machine could ever understand.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: