Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), working with a colleague at Microsoft Research, have developed a new software system that can automatically identify errors in students’ programming assignments and recommend corrections.
Teaching assistants at MIT have already begun using the software. But some variation on it could help solve one of the biggest problems faced by massive open online courses (MOOCs) like those offered through edX, the online learning initiative created by MIT and Harvard University: how to automate grading.
The system grew out of work on program synthesis — the automatic generation of computer programs that meet a programmer’s specifications — at CSAIL’s Computer-Aided Programming Group, which is led by Armando Solar-Lezama, the NBX Career Development Assistant Professor of Computer Science and Engineering. A paper describing the work will be presented this month at the Association for Computing Machinery’s Programming Language Design and Implementation conference. Joining Solar-Lezama on the paper are first author Rishabh Singh, a graduate student in his group, and Sumit Gulwani of Microsoft Research.
“One challenge, when TAs grade these assignments, is that there are many different ways to solve the same problem,” Singh says. “For a TA, it can be quite hard to figure out what type of solution the student is trying to do and what’s wrong with it.” One advantage of the new software is that it will identify the minimum number of corrections necessary to get a program working, no matter how unorthodox the programmer’s approach.
Pursuing alternatives
The new system does depend on a catalogue of the types of errors that student programmers tend to make. One such error is to begin counting from zero on one pass through a series of data items and from one in another; another is to forget to add the condition of equality to a comparison — as in, “If a is greater than or equal to b, do x.”
The first step for the researchers’ automated-grading algorithm is to identify all the spots in a student’s program where any of the common errors might have occurred. At each of those spots, the possible error establishes a range of variations in the program’s output: one output if counting begins at zero, for instance, another if it begins at one. Every possible combination of variations represents a different candidate for the corrected version of the student’s program.
“The search space is quite big,” Singh says. “You typically get 1015, 1020 possible student solutions after doing these corrections. This is where the work on synthesis that we’ve been doing comes in. We can efficiently search this space within seconds.”
One key insight from program synthesis is that the relationships between a program’s inputs and outputs can be described by a set of equations, which can be generated automatically. And solving equations is generally more efficient than running lots of candidate programs to see what answers they give.
But wherever a possible error has established a range of outputs in the original program, the corresponding equation has a “parameter” — a variable that can take on a limited range of values. Finding values for those variables that yield working programs is itself a daunting search problem.
Limiting options
The CSAIL researchers’ algorithm solves it by first selecting a single target input that should elicit a specific output from a properly working program. That requirement in itself wipes out a large number of candidate programs: Many of them will give the wrong answer even for that one input. Various candidates remain, however, and the algorithm next selects one of them at random. For that program, it then finds an input that yields an incorrect output. That input becomes a new target input for all the remaining candidate programs, and so forth, iterating back and forth between fixed inputs and fixed programs.
This process converges on a working program with surprising speed. “Most of the corrections that are wrong are going to be really wrong,” Solar-Lezama explains. “They’re going to be wrong for most inputs. So by getting rid of the things that are wrong on even a small number of inputs, you’ve already gotten rid of most of the wrong things. It’s actually hard to write a wrong thing that is going to be wrong only on one carefully selected input. But if that’s the case, then once you have that one carefully selected input, that’s that.”
The researchers are currently evaluating how their system might be used to grade homework assignments in programming MOOCs. In some sense, the system works too well: Currently, as a tool to help TAs grade homework assignments, it provides specific feedback, including the line numbers of specific errors and suggested corrections. But for online students, part of the learning process may very well involve discovering errors themselves. The researchers are currently experimenting with variations on the software that indicate the location and nature of errors with different degrees of specificity, and talking with the edX team about how the program could best be used as a pedagogic tool.
“I think that the programming-languages community has a lot to offer the broader world,” says David Walker, an associate professor of computer science at Princeton University. “Armando is looking at using these synthesis techniques to try to help in education, and I think that’s a fantastic application of this kind of programming-languages technology.”
“The kind of thing that they’re doing here is definitely just the beginning,” Walker cautions. “It will be a big challenge to scale this type of technology up so that you can use it not just in the context of the very small introductory programming examples that they cover in their paper, but in larger-scale second- or third-year problems. But it’s a very exciting area.”
Teaching assistants at MIT have already begun using the software. But some variation on it could help solve one of the biggest problems faced by massive open online courses (MOOCs) like those offered through edX, the online learning initiative created by MIT and Harvard University: how to automate grading.
The system grew out of work on program synthesis — the automatic generation of computer programs that meet a programmer’s specifications — at CSAIL’s Computer-Aided Programming Group, which is led by Armando Solar-Lezama, the NBX Career Development Assistant Professor of Computer Science and Engineering. A paper describing the work will be presented this month at the Association for Computing Machinery’s Programming Language Design and Implementation conference. Joining Solar-Lezama on the paper are first author Rishabh Singh, a graduate student in his group, and Sumit Gulwani of Microsoft Research.
“One challenge, when TAs grade these assignments, is that there are many different ways to solve the same problem,” Singh says. “For a TA, it can be quite hard to figure out what type of solution the student is trying to do and what’s wrong with it.” One advantage of the new software is that it will identify the minimum number of corrections necessary to get a program working, no matter how unorthodox the programmer’s approach.
Pursuing alternatives
The new system does depend on a catalogue of the types of errors that student programmers tend to make. One such error is to begin counting from zero on one pass through a series of data items and from one in another; another is to forget to add the condition of equality to a comparison — as in, “If a is greater than or equal to b, do x.”
The first step for the researchers’ automated-grading algorithm is to identify all the spots in a student’s program where any of the common errors might have occurred. At each of those spots, the possible error establishes a range of variations in the program’s output: one output if counting begins at zero, for instance, another if it begins at one. Every possible combination of variations represents a different candidate for the corrected version of the student’s program.
“The search space is quite big,” Singh says. “You typically get 1015, 1020 possible student solutions after doing these corrections. This is where the work on synthesis that we’ve been doing comes in. We can efficiently search this space within seconds.”
One key insight from program synthesis is that the relationships between a program’s inputs and outputs can be described by a set of equations, which can be generated automatically. And solving equations is generally more efficient than running lots of candidate programs to see what answers they give.
But wherever a possible error has established a range of outputs in the original program, the corresponding equation has a “parameter” — a variable that can take on a limited range of values. Finding values for those variables that yield working programs is itself a daunting search problem.
Limiting options
The CSAIL researchers’ algorithm solves it by first selecting a single target input that should elicit a specific output from a properly working program. That requirement in itself wipes out a large number of candidate programs: Many of them will give the wrong answer even for that one input. Various candidates remain, however, and the algorithm next selects one of them at random. For that program, it then finds an input that yields an incorrect output. That input becomes a new target input for all the remaining candidate programs, and so forth, iterating back and forth between fixed inputs and fixed programs.
This process converges on a working program with surprising speed. “Most of the corrections that are wrong are going to be really wrong,” Solar-Lezama explains. “They’re going to be wrong for most inputs. So by getting rid of the things that are wrong on even a small number of inputs, you’ve already gotten rid of most of the wrong things. It’s actually hard to write a wrong thing that is going to be wrong only on one carefully selected input. But if that’s the case, then once you have that one carefully selected input, that’s that.”
The researchers are currently evaluating how their system might be used to grade homework assignments in programming MOOCs. In some sense, the system works too well: Currently, as a tool to help TAs grade homework assignments, it provides specific feedback, including the line numbers of specific errors and suggested corrections. But for online students, part of the learning process may very well involve discovering errors themselves. The researchers are currently experimenting with variations on the software that indicate the location and nature of errors with different degrees of specificity, and talking with the edX team about how the program could best be used as a pedagogic tool.
“I think that the programming-languages community has a lot to offer the broader world,” says David Walker, an associate professor of computer science at Princeton University. “Armando is looking at using these synthesis techniques to try to help in education, and I think that’s a fantastic application of this kind of programming-languages technology.”
“The kind of thing that they’re doing here is definitely just the beginning,” Walker cautions. “It will be a big challenge to scale this type of technology up so that you can use it not just in the context of the very small introductory programming examples that they cover in their paper, but in larger-scale second- or third-year problems. But it’s a very exciting area.”