Learning Program Embeddings to Propagate Feedback on Student Code
Chris Piech, Jonathan Huang, Andy Nguyen, Mike Phulsuksombati, Mehran Sahami, Leonidas Guibas
LLearning Program Embeddings to Propagate Feedback on Student Code
Chris Piech
PIECH @ CS . STANFORD . EDU
Jonathan Huang
JONATHANHUANG @ GOOGLE . COM
Andy Nguyen
TANONEV @ CS . STANFORD . EDU
Mike Phulsuksombati
MIKEP CS . STANFORD . EDU
Mehran Sahami
SAHAMI @ CS . STANFORD . EDU
Leonidas Guibas
GUIBAS @ CS . STANFORD . EDU
Abstract
Providing feedback, both assessing final workand giving hints to stuck students, is difficultfor open-ended assignments in massive onlineclasses which can range from thousands to mil-lions of students. We introduce a neural networkmethod to encode programs as a linear mappingfrom an embedded precondition space to an em-bedded postcondition space and propose an al-gorithm for feedback at scale using these lin-ear maps as features. We apply our algorithmto assessments from the Code.org Hour of Codeand Stanford University’s CS1 course, where wepropagate human comments on student assign-ments to orders of magnitude more submissions.
1. Introduction
Online computer science courses can be massive with num-bers ranging from thousands to even millions of students.Though technology has increased our ability to providecontent to students at scale, assessing and providing feed-back (both for final work and partial solutions) remains dif-ficult. Currently, giving personalized feedback, a stapleof quality education, is costly for small, in-person class-rooms and prohibitively expensive for massive classes. Au-tonomously providing feedback is therefore a central chal-lenge for at scale computer science education.It can be difficult to apply machine learning directly to datain the form of programs. Program representations such asthe
Abstract Syntax Tree (AST) are not directly conduciveto standard statistical methods and the edit distance met-ric between such trees are not discriminative enough to beused to share feedback accurately since programs with sim-ilar ASTs can behave quite differently and require differentcomments. Moreover, though unit tests are a useful way to
Proceedings of the nd International Conference on MachineLearning , Lille, France, 2015. JMLR: W&CP volume 37. Copy-right 2015 by the author(s). test if final solutions are correct they are not well suited forgiving help to students with an intermediate solution andthey are not able to give feedback on stylistic elements.There are two major goals of our paper. The first is to au-tomatically learn a feature embedding of student submit-ted programs that captures functional and stylistic elementsand can be easily used in typical supervised machine learn-ing systems. The second is to use these features to learnhow to give automatic feedback to students. Inspired byrecent successes of deep learning for learning features inother domains like NLP and vision, we formulate a novelneural network architecture that allows us to jointly opti-mize an embedding of programs and memory-state in afeature space. See Figure 1 for an example program andcorresponding matrix embeddings.To gather data, we exploit the fact that programs are exe-cutable — that we can evaluate any piece of code on an ar-bitrary input (i.e., the precondition), and observe the stateafter, (the postcondition). For a program and its constituentparts we can thus collect arbitrarily many such precondi-tion/postcondition mappings. This data provides the train-ing set from which we can learn a shared representationfor programs. To evaluate our program embeddings we testour ability to amplify teacher feedback. We use real stu-dent data from the Code.org Hour of Code which has beenattempted by over 27 million learners making it, to the bestof our knowledge, the largest online course to-date. Wethen show how the same approach can be used for sub-missions in Stanford University’s Programming Method-ologies course which has thousands of students and assign-ments that are substantially more complex. The programswe analyze are written in a Turing-complete language butdo not allow for user-defined variables.Our main contributions are as follows. First, we presenta method for computing features of code that capture bothfunctional and stylistic elements. Our model works by si-multaneously embedding precondition and postconditionspaces of a set of programs into a feature space where pro-grams can be viewed as linear maps on this space. Second, a r X i v : . [ c s . L G ] M a y earning Program Embeddings to Propagate Feedback on Student Code public class Program extends
Karel { // Execution stars here public void run() { // Robot method putBeeper(); placeRow(); putBeeper(); } // User defined method private void placeRow() { while (isClear()){ putBeeper(); move(); } putBeeper(); }} placeRowcond body putBeeperputBeeper movewhile
Figure 1.
We learn matrices which capture functionality. Left: astudent partial solution. Right: learned matrices for the syntaxtrees rooted at each node of placeRow. we show how our code features can be useful for automati-cally propagating instructor feedback to students in a mas-sive course. Finally, we demonstrate the effectiveness ofour methods on large scale datasets. Learning embeddingsof programs is fertile ground for machine learning researchand if such embeddings can be useful for the propagation ofteacher feedback this line of investigation will have a siz-able impact on the future of computer science education.
2. Related Work
The advent of massive online computer science courses hasmade the problem of automated reasoning with large codecollections an important problem. There have been a num-ber of recent papers (Huang et al., 2013; Basu et al., 2013;Nguyen et al., 2014; Brooks et al., 2014; Lan et al., 2015;Piech et al., 2015) on using large homework submissiondatasets to improve student feedback. The volume of workspeaks to the importance of this problem. Despite the re-search efforts, however, providing quality feedback at scaleremains an open problem.A central challenge that a number of papers address is thatof measuring similarity between source code. Some au-thors have done this without an explicit featurization ofthe code — for example, the
AST edit distance has beena popular choice (Huang et al., 2013; Rogers et al., 2014).(Mokbel et al., 2013) explicitly hand engineered a smallcollection of features on ASTs that are meant to be domain-independent.To incorporate functionality, (Nguyen et al., 2014) pro-posed a method that discovers program modifications thatdo not appear to change the semantic meaning of code. Theembedded representations of programs used in this paperalso capture semantic similarities and are more amenableto prediction tasks such as propagating feedback. We ranfeedback propagation on student data using methods from Nguyen et al and observe that embeddings enabled notableimprovement (see section 6.3).Embedding programs has many crossovers with embeddingnatural language artifacts, given the similarity between theAST representation and parse trees. Our models are relatedto recent work from the NLP and deep learning commu-nities on recursive neural networks, particularly for model-ing semantics in sentences or symbolic expressions (Socheret al., 2013; 2011; Zaremba et al., 2014; Bowman, 2013).Finally, representing a potentially complicated function(which in our case is a program) as a linear operator act-ing on a nonlinear feature space has also been explored indifferent communities. The computer graphics communityhave represented pairings of nonlinear geometric shapesas linear maps between shape features, called functionalmaps (Ovsjanikov et al., 2012; 2013). From the kernelmethods literature, there has also been recent work on rep-resentations of conditional probability distributions as op-erators on a Hilbert space (Song et al., 2013; 2009). Fromthis point of view, our work is novel in that it focuses onthe joint optimization of feature embeddings together witha collection of maps so that the maps simultaneously “looklinear” with respect to the feature space.
3. Embedding Hoare Triples
Our core problem is to represent a program as a point ina fixed-dimension real-valued space that can then be useddirectly as input for typical supervised learning algorithms.While there are many dimensions that “characterize” a pro-gram including aspects such as style or time/space com-plexity, we begin by first focussing on capturing the mostbasic aspect of a program — its function. While captur-ing the function of the program ignores aspects that can beuseful in application (such as giving stylistic feedback inCS education), we discuss in later sections how elementsof style can be recaptured by modeling the function of sub-programs that correspond to each subtree of an AST. Givena program A (where we consider a program to generally beany executable code whether a full submission or a subtreeof a submission), and a precondition P , we thus would liketo learn features of A that are useful for predicting the out-come of running A when P holds. In other words, we wantto predict a postcondition Q out of some space of possiblepostconditions. Without loss of generality we let P and Q be real-valued vectors encapsulating the “state” of the pro-gram (i.e., the values of all program variables) at a partic-ular time. For example, in a grid world, this vector wouldcontain the location of the agent, the direction the agentis facing, the status of the board and whether the programhas crashed. Figure 2 visualizes two preconditions, and thecorresponding postconditions for a simple program.We propose to learn program features using a training set of earning Program Embeddings to Propagate Feedback on Student Code E n c od e r f Q = M A f P P f P1 f Q1 Q f Pk f Qk P k Q k M A D ec od e r method step() { putBeeper(); move();} A Figure 2.
Diagram of the model for a program A implementing a simple “step forward” behavior in a small 1-dimensional gridworld.Two of the k Hoare triples that correspond with A are shown. Typical worlds are larger and programs are more complex. ( P, A, Q ) -triples — so-called Hoare triples (Hoare, 1969)obtained via historical runs of a collection of programs ona collection of preconditions. We discuss the process bywhich such a dataset can be obtained in Section 5. Themain approach that we espouse in this paper is to simul-taneously find an embedding of states and programs intofeature space where pre and postconditions are points inthis space and programs are mappings between them.The simple way that we propose to relate preconditionsto postconditions is through a linear transformation. Ex-plicitly, given a ( P, A, Q ) -triple, if f P and f Q are m -dimensional nonlinear feature representations of the preand postconditions P and Q , respectively, then we relatethe embeddings via the equation f Q = M A · f P . (1)We then take the m × m matrix of coefficients M A as ourfeature representation of the program A and refer to it asthe program embedding matrix . We will want to learn themapping into feature space f as well as the linear map M A such that this equality holds for all observed triples and cangeneralize to predict postcondition Q given P and A .At first blush, this linear relationship may seem too limitingas programs are not linear nor continuous in general. Bylearning a nonlinear embedding function f for the pre andpostcondition spaces, however, we can capture a rich fam-ily of nonlinear relationships much in the same way thatkernel methods allow for nonlinear decision boundaries.As described so far, there remain a number of modelingchoices to be made. In the following, we elaborate furtheron how we model the feature embeddings f P , and f Q ofthe pre and postconditions, and how to model the programembedding matrix M A . We assume that preconditions have some base encoding asa d -dimensional vector, which we refer to as P . For ex-ample, in image processing courses, the state space couldsimply be the pixel encoding of an image, whereas in thediscrete gridworld-type programming problems that we usein our experiments, we might choose to encode the ( x, y ) - coordinate and discretized heading of a robot using a con-catenation of one-hot encodings. Similarly, we assume thatthere is a base encoding Q of the postcondition.We will focus our exposition in the remainder of our paperon the case where the precondition space and postconditionspaces share a common base encoding. This is particularlyappropriate to our experimental setting in which both thepreconditions and postconditions are representations of agridworld. In this case, we can use the same decoder pa-rameters (i.e., W dec and b dec ) to decode both from precon-dition space and postcondition space — a fact that we willexploit in the following section.Inspired by nonlinear autoencoders, we parameterize amapping, called the encoder from precondition P to a non-linear m -dimensional feature representation f P . As withtraditional autoencoders, we use an affine mapping com-posed with an elementwise nonlinearity: f P = φ ( W enc · P + b enc ) , (2)where W enc ∈ R m × d , b enc ∈ R m , and φ is an elementwisenonlinear function (such as tanh ). At this point, we can usethe representation f P to decode or reconstruct the originalprecondition as a traditional autoencoder would do using: ˆ P = ψ ( W dec · f P + b dec ) , (3)where W dec ∈ R d × m , b dec ∈ R d , and ψ is some (po-tentially different) elementwise nonlinear function. More-over, we can push the precondition embedding f P throughEquation 1, and decode the postcondition embedding f Q = M A · f P . This mapping which reconstructs the postcondi-tion Q , the decoder , takes the form: ˆ Q = ψ ( W dec · f Q + b dec ) , (4) = ψ ( W dec · M A · f P + b dec ) . (5) Figure 2 diagrams our model on a simple program. Notethat it is possible to swap in alternative feature represen-tations. We have experimented with using a deep, stackedautoencoder however our results have not shown these tohelp much in the context of our datasets. earning Program Embeddings to Propagate Feedback on Student Code
To encode the program embedding matrix, we proposea simple nonparametric model in which each program inthe training set is associated with its own embedding ma-trix. Specifically, if the collection of unique programs is { A , . . . , A m } , then for each A i , we will associate a ma-trix M i . The entire parameter set for our nonparamet-ric matrix model (henceforth abbreviated NPM ) is thus:
Θ = { W dec , W enc , b enc , b dec } ∪ { M i : i = 1 , . . . , m } .To learn the parameters, we minimize a sum of three terms:(1) a prediction loss (cid:96) pred which quantifies how well wecan predict postcondition of a program given a precondi-tion, (2) an autoencoding loss (cid:96) auto which quantifies howgood the encoder and decoder parameters are for recon-structing given preconditions, and (3) a regularization term R . Formally, given training triples { ( P i , A i , Q i ) } ni =1 , wecan minimize the following objective function: L (Θ) = 1 n n (cid:88) i =1 (cid:96) pred ( Q i , ˆ Q i ( P i , A i ; Θ))+ 1 n n (cid:88) i =1 (cid:96) auto ( P i , ˆ P i ( P i , Θ)) + λ R (Θ) , (6) where R is a regularization term on the parameters, and λ a regularization parameter. In our experiments, we use R to penalize the sum of the L norms of the weight matrices(excluding the bias terms b enc and b dec ).Any differentiable loss can conceptually be used for (cid:96) pred and (cid:96) auto . For example, when the top level predictions, ˆ P or ˆ Q , can be interpreted as probabilities (e.g., when φ is theSoftmax function), we use a cross-entropy loss function.Informally speaking, one can think of our optimizationproblem (Equation 6) as trying to find a good shared rep-resentation of the state space — shared in the sense thateven though programs are clearly not linear maps over theoriginal state space, the hope is that we can discover somenonlinear encoding of the pre and postconditions such thatmost programs simultaneously “look” linear in this newprojected feature space. As we empirically show in Sec-tion 6, such a representation is indeed discoverable.We run joint optimization using minibatch stochastic gra-dient descent without momentum, using ordinary back-propagation to calculate the gradient. We use randomsearch (Bergstra & Bengio, 2012) to optimize over hyper-parameters (e.g, regularization parameters, matrix dimen-sions, and minibatch size). Learning rates are set usingAdagrad (Duchi et al., 2011). We seed our parameters us-ing a “smart” initialization in which we first learn an au-toencoder on the state space, and perform a vector-valuedridge regression for each unique program to extract a ma-trix mapping the features of the precondition to the features of the postcondition. The encoder and decoder parametersand the program matrices are then jointly optimized. For a given program S we extract Hoare triples by execut-ing it on an exemplar set of unit tests. These tests spana variety of reasonable starting conditions. We instrumentthe execution of the program such that each time a subtree A ⊂ S is executed, we record the value, P , of all variablesbefore execution, and the value, Q , of all variables after ex-ecution and save the triple ( P , A , Q ). We run all programson unit tests, collecting triples for all subtrees. Doing soresults in a large dataset { ( P i , A i , Q i ) } ni =1 from which wecollapse equivalent triples. In practice, some subtrees, es-pecially the body of loops, generate a large (potentially in-finite) number of triples. To prevent any subtree from hav-ing undue influence on our model we limit the number oftriples for any subtree.Collecting triples on subtrees, as opposed to just collectingtriples on complete programs, is critical since it allows usto learn embeddings not just for the root of a program ASTbut also for the constituent parts. As a result, we retain dataon how a program was implemented, and not just on itsoverall functionality, which is important for student feed-back as we discuss in the next section. Collecting tripleson subtrees also means we are able to optimize our embed-dings with substantially more data.
4. Feedback Propagation
The result of jointly learning to embed states and a corpusof programs is a fixed dimensional, real-valued matrix M A for each subtree A of any program in our corpus. These ma-trices can be cooperative with machine learning algorithmsthat can perform tasks beyond predicting what a programdoes. The central application in this paper is the forcemultiplication of teacher-provided feedback where an ac-tive learning algorithm interacts with human graders suchthat feedback is given to many more assignments than thegrader annotates. We propose a two phase interaction. Inthe first phase, the algorithm selects a subset of exemplarprograms for graders to apply a finite set of annotations.Then in the second phase, the algorithm uses the humanprovided annotations as supervised labels with which it canlearn to predict feedback for unlabelled submissions. Eachprogram is annotated with a set H ⊂ L where L is a dis-crete collection of N possible annotations. The annotationsare meant to cover a range of comments a grader could ap-ply, including feedback on style, strategy and functionality.For each ungraded submission, we must then decide whichof the N labels to apply. As such, we view feedback prop-agation as N binary classification tasks.One way of propagating feedback would be to use the el- earning Program Embeddings to Propagate Feedback on Student Code ements of the embedding matrix of the root of a programas features and then train a classifier to predict appropri-ate feedback for a given program. However, the matriceswe have learned for programs and their subtrees have beentrained only to predict functionality. Consequently, any twoprograms that are functionally indistinguishable would begiven the same instructor feedback under this approach,ignoring any strategic or stylistic differences between theprograms. To recapture the elements of program structure and stylethat are critical for student feedback, our approach to pre-dict feedback uses the embedding matrices learned for theNPM model, but incorporates all constituent subtrees ofa given AST. Specifically, using the embedding matriceslearned in the NPM model (which we henceforth denote as M NP MA for a subtree A ), we now propose a new modelbased on recursive neural networks (called the NPM-RNN model) in which we parametrize a matrix M A in this newmodel with an RNN whose architecture follows the abstractsyntax tree (similar to the way in which RNN architec-tures might take the form of a parse tree in an NLP set-ting (Socher et al., 2013)).In our RNN based model, a subtree of the AST rootedat node j is represented by a matrix which is computedby combining (1) representations of subtrees rooted at thechildren of j , and (2) the embedding matrix of the subtreerooted at node j learned via the NPM model. By incorpo-rating the embedding matrix from the NPM model, we areable to capture the function of every subtree in the AST.Formally, we will assume each node is associated withsome type in set T = { ω , ω , . . . } . Concretely, the typeset might be the collection of keywords or built-in func-tions that can be called from a program in the dataset, e.g., T = { repeat , while , if , . . . } . A node with type ω is as-sumed to have a fixed number, a ω , of children in the AST— for example, a repeat node has two children, with onechild holding the body of a repeat loop and the second rep-resenting the number of times the body is to be repeated.The representation of node j with type ω is then recursivelycomputed in the NPM-RNN model via: a ( j ) = φ (cid:32) a ω (cid:88) i =1 W ωi · a ( c i [ j ]) + b ω + µM NPMj (cid:33) , (7) where: φ is a nonlinearity (such as tanh ), c i [ j ] indexes overthe a ω children of node j , and M NP Mj is the program em-bedding matrix learned in the NPM model for the subtreerooted at node j . We remind the reader that the activation a ( j ) at each node is an m × m matrix. Leaf nodes of type ω are simply associated with a single parameter matrix W ω .In the NPM-RNN model, we have parameter matrices Statistic Ω Ω Ω Num Students >
11 million 2,710 2,710Unique Programs 210,918 6,674 63,820Unique Subtrees 311,198 15,550 198,918Unique Triples 5,334,452 476,502 4,211,150Unique States 149 1,399 114,704Unique Annotations 15 12 14
Table 1.
Dataset summary. Programs are considered identical ifthey have equal ASTs. Unique states are different configurationsof the gridworld which occur in student programs. W ω , b ω ∈ R m × m for each possible type ω ∈ T . Totrain the parameters, we first use the NPM model to com-pute the embedding matrix M NP Mj for each subtree. Af-ter fixing M j , we optimize (as with the NPM model) withminibatch stochastic gradient descent using backpropaga-tion through structure (Goller & Kuchler, 1996) to com-pute gradients. Instead of optimizing for predicting post-condition, for NPM-RNN, we optimize for each of the bi-nary prediction tasks that are used for feedback propaga-tion given the vector embedding at the root of a program.We used hyper-parameters learned in the RNN model op-timization since feedback optimization is performed overfew examples and without a holdout set.Finally, feedback propagation has a natural active learningcomponent: intelligently selecting submissions for humanannotation can potentially save instructors significant time.We find that in practice, running k -means on the learnedembeddings, and selecting the cluster centroids as the set ofsubmissions to be annotated works well and leads to signif-icant improvements in feedback propagation over randomsubset selection. Surprisingly, having humans annotate themost common programs performs worse than the alterna-tives, which we observe to be due to the fact that the mostcommon submissions are all quite similar to one another.
5. Datasets
We evaluate our model on three assignments from two dif-ferent courses, Code.org’s Hour of Code (HOC) whichhas submissions from over 27 million students and Stan-fords Programming Methodology course, a first-term intro-ductory programming course, which has collected submis-sions over many years from almost three thousand students.From these two classes, we look at three different assign-ments. As in many introductory programming courses, thefirst assignments have the students write standard program-ming control flow (if/else statements, loops, methods) butdo not introduce user-defined variables. The programs forthese assignments operate in maze worlds where an agentcan move, turn, and test for conditions of its current loca-tion. In the Stanford assignments, agents can also put downand pick up beepers, making the language Turing complete. earning Program Embeddings to Propagate Feedback on Student Code
Specifically, we study the following three problems: Ω : The 18 th problem in the Hour of Code (HOC). Studentssolve a task which requires an if/else block inside of a whileloop, the most difficult concept in the Hour of Code. Ω : The first assignment in Stanford’s course. Studentsprogram an agent to retrieve a beeper in a fixed world. Ω : The fourth assignment in Stanford’s course. Studentsprogram an agent to find the midpoint of a world with un-known dimension. There are multiple strategies for thisproblem and many require O ( n ) operations where n isthe size of the world. The task is challenging even for thosewho already know how to program.In addition to the final submission to any problem, fromeach student we also collect partial solutions as theyprogress from starter code to final answer. Table 1 summa-rizes the sizes of each of the datasets. For all three assign-ments studied, students take multiple steps to reach theirfinal answer and as a result most programs in our datasetsare intermediate solutions that are not responsive to unittests that simply evaluate correctness. The code.org datasetis available at code.org/research .For all assignments we have both functional and stylisticfeedback based on class rubrics which range from observa-tions of solution strategy, to notes on code decomposition,and tests for correctness. The feedback is generated forall submissions (including partial solutions) via a complexscript. The script analyzes both the program trees and theseries of steps a student took to assign annotations. In gen-eral, a script, no matter how complex, does not provide per-fect feedback. However the ability to recreate these com-plex annotations allows us to rigorously evaluate our meth-ods. An algorithm that is able to propagate such feedbackshould also be able to propagate human quality labels.
6. Results
We rely on a few baselines against which to evaluate ourmethods, but the main baseline that we compare to is a sim-plification of the NPM-RNN model (which we will call,simply,
RNN ) in which we drop the program embeddingterms M j from each node (cf. Eqn. 7).The RNN model can be trained to predict postconditionsas well as to propagate feedback. It has much fewer pa-rameters than the NPM (and thus NPM-RNN) model beinga strictly parametric model, and is thus expected to havean advantage in smaller training set regimes. On the otherhand, it is also a strictly less expressive model and so thequestion is: how much does the expressive power of theNPM and NPM-RNN models actually help in practice? Weaddress this question amongst others using two tasks: pre-dicting postcondition and propagating feedback. Algorithm Ω Ω Ω NPM 95% (98%) 87% (98%) 81% (94%)RNN 96% (97%) 94% (95%) 46% (45%)Common 58% 51% 42%
Table 2.
Test set postcondition prediction accuracy on the threeprogramming problems. Training set results in parentheses.
To understand how much functionality of a program is cap-tured in our embeddings, we evaluate the accuracy to whichwe can use the program embedding matrices learned by theNPM model to predict postconditions — note, however,that we are not proposing to use the embeddings to predictpost-conditions in practice. We split our observed Hoaretriples into training and test sets and learn our NPM modelusing the training set. Then for each triple ( P, A, Q ) inthe test set we measure how well we can predict the post-condition Q given the corresponding program A and pre-condition P . We evaluate accuracy as the average numberof state variables (e.g. row, column, orientation and loca-tion of beepers) that are correctly predicted per triple, andin addition to the RNN model, compare against the base-line method “Common” where we select the most com-mon postcondition for a given precondition observed in thetraining set. As our results in Table 2 show, the NPM modelachieves the best training accuracy (with 98%, 98% and94% accuracy respectively, for the three problems). Forthe two simpler problems, the parametric (RNN) modelachieves slightly better test accuracy, especially for prob-lem Ω where the training set is much smaller. For the mostcomplex programming problem, Ω , however, the NPMmodel substantially outperforms other approaches. If we are to represent programs as matrices that act on afeature space, then a natural desiderata is that they “com-pose well”. That is, if program C is functionally equiva-lent to running program B followed by program A , thenit should be the case that M C ≈ M B · M A . To evaluatethe extent to which our program embedding matrices are composable , we use a corpus of 5000 programs that arecomposed of a subprogram A followed by another subpro-gram B (Compose-2). We then compare the accuracy ofpostcondition prediction using the embedding of an entireprogram M C against the product of embeddings M B · M A .As Table 3 shows, the accuracy using the NPM model forpredicting postcondition is 94% when using the matrix forthe root embedding. Using the product of two embeddingmatrices, we see that accuracy does not fall dramatically,with a decoding accuracy of 92%. When we test programsthat are composed of three subprograms, A followed by B ,then C (Compose-3), we see accuracy drop only to 83%. earning Program Embeddings to Propagate Feedback on Student Code Test Direct NPM NPM-0 RNN CommonCompose-2 94% 92% 87% 42% 39%Compose-3 94% 83% 72% 28% 39%
Table 3.
Evaluation of composability of embedding matrices: Ac-curacy on 5k random triples with ASTs rooted at block nodes.NPM-0 does not jointly optimize.
By comparison, the embeddings computed using the RNN,a more constrained model, do not seem to satisfy com-posability. We also compare against NPM-0, which is theNPM model using just the weights set by the smart initial-ization (see Section 3.2). While NPM-0 outperforms theRNN, the full nonparametric model (NPM) performs muchbetter, suggesting that the joint optimization (of state andprogram embeddings) allows us to learn an embedding ofthe state space that is more amenable to composition.
We now use our program embedding matrices in the feed-back propagation application described in Section 4. Thecentral question is: given a budget of K human annotatedprograms (we set K = 500 ), what fraction of unannotatedprograms can we propagate these annotations to using thelabelled programs, and at what precision? Alternatively,we are interested in the “force multiplication factor” — theratio of students who receive feedback via propagation tostudents to receive human feedback.Figure 3 visualizes recall and precision of our experimenton each of the three problems. The results translate to214 × , 12 × and 45 × force multiplication factors of teachereffort for Ω , Ω and Ω respectively while maintaining90% precision. The amount to which we can force multi-ply feedback depends both on the recall of our model andthe size of the corpus to which we are propagating feed-back. For example, though Ω had substantially higherrecall than Ω , in Ω the grading task was much smaller.There were only 6,700 unique programs to propagate feed-back to, compared to Ω which had over 210,000. As withthe previous experiment, we observe that for both Ω and Ω , the NPM-RNN and RNN models perform similarly.However for Ω , the NPM-RNN model substantially out-performs all alternatives.In addition to the RNN, we compare our results to threeother baselines: (1) Running unit tests, (2) a “Bag-of-Trees” approach and (3) k -nearest neighbor (KNN) withAST edit distances. The unit tests unsurprisingly are per-fect at recognizing correct solutions. However, since ourdataset is largely composed of intermediate solutions andnot final submissions (especially for Ω and Ω ), unit testsare not a particularly effective way to propagate annota-tions. The Bag-of-Trees approach, where we trained aNa¨ıve Bayes model to predict feedback conditioned on the set of subtrees in a program, is useful for feedback prop-agation but we observe that it underperforms the embed-ding solutions on each problem. Moreover, we extendedthis baseline by amalgamating functionally equivalent code(Nguyen et al., 2014). Using equivalences found using sim-ilar amount of effort as in previous work, we are able toachieve 90% precision with recall of 39%, 48% and 13%,for the three problems respectively. While this improvesthe baseline, NPM-RNN obtains almost twice as much re-call on all problems. Finally, we find KNN with AST editdistances to be computationally expensive to run and highlyineffective at propagating feedback — calculating edit dis-tance between all trees requires 20 billion comparisons for Ω and 1.5 billion comparisons for Ω . Moreover, the high-est precision achieved by KNN for Ω is only 43% (notethat the cut-off for the x-axis in Figure 3 is 80%) and atthat precision only has a recall of 1.3%.The feedback that we propagate covers a range of stylis-tic and functional annotations. To further understand thestrengths and weaknesses of our solution, we explore theperformance of the NPM-RNN model on each of the ninepossible annotations for Ω . As we see in Figure 4(c), ourmodel performs best on functional feedback with an av-erage 44% recall at 90% precision, followed by strategicfeedback and performs worst at propagating purely stylis-tic annotations with averages of 31% and 8% respectively.Overall propagation for Ω is 33% recall at 90% precision. The results from the above experiments are suggestive thatthe nonparametric models perform better on more complexcode while the parametric (RNN) model performs betteron simpler code. To dig deeper, we now look specificallyinto how our performance depends on the complexity ofprograms in our corpus — a question that is also centralto understanding how our models might apply to other as-signments. We focus on submissions for Ω , which cover arange of complexities, from simple programs to ones withover 50 decision points (loops and if statements). The dis-tribution of cyclomatic complexity (McCabe, 1976), a mea-sure of code structure, reflects this wide range (shown ingray in Figures 4(a),(b)). We first sort and bin all submis-sions to Ω by cyclomatic complexity into ten groups ofequal size. Figures 4(a),(b) plot the results of the post-condition prediction and force multiplication experimentsrun individually on these smaller bins (still using a holdoutset, and a budget of 500 graded submissions). While theRNN model performs better for simple programs (with cy-clomatic complexity ≤ ), both train and test accuracies forthe RNN degrade dramatically as programs become morecomplicated. On the other hand, while the NPM modeloverfits, it maintains steady (and better) performance in testaccuracy as complexity increases. This pattern may help to earning Program Embeddings to Propagate Feedback on Student Code R ec a ll Precision
Unit tests (a) R ec a ll Precision B a g o f T r ee s Unit tests (b) R ec a ll Precision
RNN N P N - R NN B a g o f T r ee s Unit tests (c)
Figure 3.
Recall of feedback propagation as a function of precision for three programming problems: (a) Ω , (b) Ω , and (c) Ω . Oneach, we compare our NPM-RNN against the RNN method and two other baselines (bag of trees and unit tests). P o s t c o nd i t i o n P re d i c t i o n A cc u r a c y Cyclomatic Complexity
NPM (train)
NPM (test) R NN ( t r a i n ) R NN ( t e s t ) (a) F o rce M u l t i p li c a t i o n R ec a ll Cyclomatic Complexity N P M - R NN RNN (b) Redefines Method
Decomposed
Line Strategy
Diagonal Strategy
Corners Strategy
Milestone Reached
One by One Correct
Odd World Correct
Even World Correct
Recall at 90% precision F un c ti on a l S t r a t e g i c S t y li s ti c (c) Figure 4. (a) NPM and RNN postcondition prediction accuracy as a function of cyclomatic complexity of submitted programs; (b) NPM-RNN and RNN feedback propagation recall (at 90% precision). Note that the ratio of human graded assignments to number of programsis much higher in this experiment than Figure 3; (c) A breakdown of the accuracy of the nonparametric model by feedback type for Ω (black dots). The gray bars histogram the feedback types by frequency. explain our observations that the RNN is more accurate forforce multiplying feedback on simple problems.
7. Discussion
In this paper we have presented a method for finding simul-taneous embeddings of preconditions and postconditionsinto points in shared Euclidean space where a program canbe viewed as a linear mapping between these points. Theseembeddings are predictive of the function of a program,and as we have shown, can be applied to the the tasks ofpropagating teacher feedback. The courses we evaluate ourmodel on are compelling case studies for different reasons.Tens of millions of students are expected to use Code.orgnext year, meaning that the ability to autonomously providefeedback could impact an enormous number of people. TheStanford course, though much smaller, highlights the com-plexity of the code that our method can handle.There remains much work towards making these embed-dings more generally applicable, particularly for domainswhere we do not have tens of thousands of submissions perproblem or the programs are more complex. For settingswhere users can define their own variables it would be nec- essary to find a novel method for mapping program mem-ory into vector space. An interesting future direction mightbe to jointly find embeddings across multiple homeworksfrom the same course, and ultimately, to even learn usingarbitrary code outside of a classroom environment. To doso may require more expressive models. From the stand-point of purely predicting program output, the approachesdescribed in this paper are not capable of representing ar-bitrary computation in the sense of the Church-Turing the-sis. However, there has been recent progress in the deeplearning community towards models capable of simulatingTuring machines (Graves et al., 2014). While this “NeuralTuring Machines” line of work approaches quite a differ-ent problem than our own, we remark that such expressiverepresentations may indeed be important for statistical rea-soning with arbitrary code databases.For the time being, feature embeddings of code can at leastbe learned using the massive online education datasets thathave only recently become available. And we believe thatthese features will be useful in a variety of ways — not justin propagating feedback, but also in tasks such as predict-ing future struggles and even student dropout. earning Program Embeddings to Propagate Feedback on Student Code
Acknowledgments
We would like to thank Kevin Murphy, John Mitchell, VovaKim, Roland Angst, Steve Cooper and Justin Solomon fortheir critical feedback and useful discussions. We appreci-ate the generosity of the Code.Org team, especially Nan Liand Ellen Spertus, who providing data and support. Chrisis supported by NSF-GRFP grant number DGE-114747.
References
Basu, Sumit, Jacobs, Chuck, and Vanderwende, Lucy.Powergrading: a clustering approach to amplify humaneffort for short answer grading.
Transactions of the Asso-ciation for Computational Linguistics , 1:391–402, 2013.Bergstra, James and Bengio, Yoshua. Random search forhyper-parameter optimization.
The Journal of MachineLearning Research , 13(1):281–305, 2012.Bowman, Samuel R. Can recursive neural tensor net-works learn logical reasoning? arXiv preprintarXiv:1312.6192 , 2013.Brooks, Michael, Basu, Sumit, Jacobs, Charles, and Van-derwende, Lucy. Divide and correct: Using clusters tograde short answers at scale. In
Proceedings of the firstACM conference on Learning@ scale conference , pp.89–98. ACM, 2014.Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptivesubgradient methods for online learning and stochasticoptimization.
The Journal of Machine Learning Re-search , 12:2121–2159, 2011.Goller, Christoph and Kuchler, Andreas. Learning task-dependent distributed representations by backpropaga-tion through structure. In
Neural Networks, 1996., IEEEInternational Conference on , volume 1, pp. 347–352.IEEE, 1996.Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neuralturing machines. arXiv preprint arXiv:1410.5401 , 2014.Hoare, Charles Antony Richard. An axiomatic basis forcomputer programming.
Communications of the ACM ,12(10):576–580, 1969.Huang, Jonathan, Piech, Chris, Nguyen, Andy, and Guibas,Leonidas J. Syntactic and functional variability of a mil-lion code submissions in a machine learning mooc. In
The 16th International Conference on Artificial Intelli-gence in Education (AIED 2013) Workshop on MassiveOpen Online Courses (MOOCshop) , 2013.Lan, Andrew S, Vats, Divyanshu, Waters, Andrew E,and Baraniuk, Richard G. Mathematical language pro-cessing: Automatic grading and feedback for open response mathematical questions. arXiv preprintarXiv:1501.04346 , 2015.McCabe, Thomas J. A complexity measure.
Software En-gineering, IEEE Transactions on , (4):308–320, 1976.Mokbel, Bassam, Gross, Sebastian, Paassen, Benjamin,Pinkwart, Niels, and Hammer, Barbara. Domain-independent proximity measures in intelligent tutoringsystems. In
Proceedings of the 6th International Confer-ence on Educational Data Mining (EDM) , 2013.Nguyen, Andy, Piech, Christopher, Huang, Jonathan, andGuibas, Leonidas. Codewebs: Scalable homeworksearch for massive open online programming courses. In
Proceedings of the 23rd International World Wide WebConference (WWW 2014) , Seoul, Korea, 2014.Ovsjanikov, Maks, Ben-Chen, Mirela, Solomon, Justin,Butscher, Adrian, and Guibas, Leonidas. Functionalmaps: a flexible representation of maps between shapes.
ACM Transactions on Graphics (TOG) , 31(4):30, 2012.Ovsjanikov, Maks, Ben-Chen, Mirela, Chazal, Frederic,and Guibas, Leonidas. Analysis and visualization ofmaps between shapes. In
Computer Graphics Forum ,volume 32, pp. 135–145. Wiley Online Library, 2013.Piech, Chris, Sahami, Mehran, Huang, Jonathan, andGuibas, Leonidas. Autonomously generating hints by in-ferring problem solving policies. In
Proceedings of theSecond (2015) ACM Conference on Learning @ Scale ,L@S ’15, pp. 195–204. ACM, 2015.Rogers, Stephanie, Garcia, Dan, Canny, John F, Tang,Steven, and Kang, Daniel.
ACES: Automatic evaluationof coding style . PhD thesis, Masters thesis, EECS De-partment, University of California, Berkeley, 2014.Socher, Richard, Pennington, Jeffrey, Huang, Eric H,Ng, Andrew Y, and Manning, Christopher D. Semi-supervised recursive autoencoders for predicting senti-ment distributions. In
Proceedings of the Conferenceon Empirical Methods in Natural Language Processing ,pp. 151–161. Association for Computational Linguistics,2011.Socher, Richard, Perelygin, Alex, Wu, Jean Y, Chuang, Ja-son, Manning, Christopher D, Ng, Andrew Y, and Potts,Christopher. Recursive deep models for semantic com-positionality over a sentiment treebank. In
Proceedingsof the Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP) , pp. 1631–1642. Citeseer,2013. earning Program Embeddings to Propagate Feedback on Student Code
Song, Le, Huang, Jonathan, Smola, Alex, and Fukumizu,Kenji. Hilbert space embeddings of conditional distri-butions with applications to dynamical systems. In
Pro-ceedings of the 26th Annual International Conference onMachine Learning , pp. 961–968. ACM, 2009.Song, Le, Fukumizu, Kenji, and Gretton, Arthur. Kernelembeddings of conditional distributions: A unified ker-nel framework for nonparametric inference in graphicalmodels.
Signal Processing Magazine, IEEE , 30(4):98–111, 2013.Zaremba, Wojciech, Kurach, Karol, and Fergus, Rob.Learning to discover efficient mathematical identities. In