[PDF] Automating Program Structure Classification

Abstract

When students write programs, their program structure provides insight into their learning process. However, analyzing program structure by hand is time-consuming, and teachers need better tools for computer-assisted exploration of student solutions. As a first step towards an education-oriented program analysis toolkit, we show how supervised machine learning methods can automatically classify student programs into a predetermined set of high-level structures. We evaluate two models on classifying student solutions to the Rainfall problem: a nearest-neighbors classifier using syntax tree edit distance and a recurrent neural network. We demonstrate that these models can achieve 91% classification accuracy when trained on 108 programs. We further explore the generality, trade-offs, and failure cases of each model.

Full PDF

AAutomating Program Structure Classification

Will Crichton [email protected] University

Georgia Gabriela Sampaio [email protected] University

Pat Hanrahan

Stanford University

ABSTRACT

When students write programs, their program structure providesinsight into their learning process. However, analyzing programstructure by hand is time-consuming, and teachers need better toolsfor computer-assisted exploration of student solutions. As a firststep towards an education-oriented program analysis toolkit, weshow how supervised machine learning methods can automaticallyclassify student programs into a predetermined set of high-levelstructures. We evaluate two models on classifying student solutionsto the Rainfall problem: a nearest-neighbors classifier using syntaxtree edit distance and a recurrent neural network. We demonstratethat these models can achieve 91% classification accuracy whentrained on 108 programs. We further explore the generality, trade-offs, and failure cases of each model.

CCS CONCEPTS • Social and professional topics → Computing education ; •

Computing methodologies → Supervised learning by classifica-tion . KEYWORDS

Program classification, machine learning, neural networks

ACM Reference Format:

Will Crichton, Georgia Gabriela Sampaio, and Pat Hanrahan. 2021. Au-tomating Program Structure Classification. In

Proceedings of the 52nd ACMTechnical Symposium on Computer Science Education (SIGCSE ’21), March13–20, 2021, Virtual Event, USA.

ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/3408877.3432358

When a teacher creates a new programming assignment, they oftenwonder: what different kinds of solutions did my students comeup with, and why? The strategies that students use, the way theyorganize their code — this program structure can reveal what stu-dents have (or haven’t) learned. Classifying the structure of studentsolutions has been used to identify misconceptions [27], success pre-dictors [25], and problem solving milestones [28]. Studies on plancomposition — how students combine code templates to solve pro-gramming problems — have long used program structure to analyzestudent problem solving. For example, Fisler’s 2014 study of plan

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

SIGCSE ’21, March 13–20, 2021, Virtual Event, USA © 2021 Association for Computing Machinery.ACM ISBN 978-1-4503-8062-1/21/03...$15.00https://doi.org/10.1145/3408877.3432358 composition in functional languages showed that certain high-levelprogram structures correlated with increased error rates [7].However, analyzing program structure is a challenging, timeintensive work. In personal correspondence, Fisler estimated thathand-coding program structure for her 2014 study took 1-2 minutesper program. This estimate is consistent with Wu et al. who reportedthat hand-labeling student misconceptions in Code.org programstook an average of 2 minutes per program [27]. For CS1 courseswith hundreds of students, such a per-program cost is prohibitive,motivating the use of automation to alleviate the burden of manualinspection.Our vision for the future is that automatic program structureclassification should be a tool in every CS teacher’s toolbox. Thisanalysis augments traditional forms of feedback (grades, officehours, etc.) with a new channel for understanding how studentsapproach problems. With this tool, a teacher could explore thestructural variation in student programs, assisted by data-driventechnology to avoid a purely manual inspection process.However such a tool does not exist today, so our goal is to laythe foundations for its development. That goal starts with the ques-tions: what technologies could be used, and how well do they work?Ethical use of algorithms to analyze students requires a deep under-standing of their accuracy and failure modes. Hence, in this paper,we perform a thorough evaluation of multiple tools for classify-ing student program structure. We address the following researchquestions:

RQ1.

How much training does this tool need to be accurate?

RQ2.

How accurate is the tool on different languages?

RQ3.

When and why does this tool make errors?We answer these questions for a well-studied programming prob-lem: Rainfall, a simple list-processing task. We evaluate two super-vised machine learning methods, nearest-neighbors and recurrentneural networks, on an existing dataset of student solutions to Rain-fall in the OCaml and Pyret languages. We demonstrate that theseapproaches can classify Rainfall program structures with up to 91%accuracy when trained on 108 examples.

Many kinds of high-level program analysis can be viewed as pro-gram classification. Plagiarism detection systems like MOSS [4]take a given student’s program, and classify other programs as“plagiarized” or “different”. We do not consider this task as program structure classification, as plagiarism systems predominantly com-pare low-level syntax differences, e.g. whether two programs arethe same modulo renamed variables.Other prior works attempt to classify the kind of problem beingaddressed in a program, or what algorithm is being used. Taherkhani a r X i v : . [ c s . C Y ] J a n let rec help (alon : float list ) = match alon with | [] -> [] | hd::tl -> if hd = (-999.) then [] else if hd >= 0. then hd :: (help tl) else help tl : float list ) let rec rainfall (alon : float list ) = match alon with | [] -> failwith "no proper rainfall amounts" | _::_ -> ( List .fold_right (+.) (help alon) 0.) /. (float_of_int ( List .length (help alon)) let rec cut_list (a : int list ) = match a with | [] -> [] | hd::tl -> if hd <> (-999) then hd :: (cut_list tl) else [] let non_neg (a : int list ) = List .filter ( fun x -> x > 0) a let average (a : int list ) = ( List .fold_right ( fun x -> fun _val -> x + _val) a 0) / ( List .length a) let rainfall (a : int list ) = average (non_neg (cut_list a)) Figure 1: Two example “Clean First” OCaml programs. Even for the same high-level structure, student solutions exhibit asignificant diversity in syntactic and semantic variation such as the function decomposition strategies shown here. and Malmi [21] classify sorting algorithms from source code usingdecision-trees on hand-engineered problem-specific code features.For example, their features included “whether or not the algorithmimplementation uses extra memory” and “whether from the twonested loops used in the implementation of the algorithm, the outerloop is incrementing and the inner decrementing.” Moving awayfrom hand-engineering program features, the software engineeringcommunity has also applied deep learning techniques for similartasks. Mou et al. [18] defined a task of classifying which of 104programming competition problems a program is attempting tosolve, and they apply a novel tree-based neural network for thistask. Bui et al. [6] introduce bilateral neural networks to solve thesame problem in a language-independent manner. In this work, wefocus on classifying how a student solved a problem, as opposed towhat problem they were solving.Closer to our application domain, Ahmed and Sindhgatta etal. [1] use program structure classification to suggest repairs tostudents. Wu et al. [27] evaluate a recurrent neural network (RNN)and multimodal variational autoencoder on classifying misconcep-tions in Code.org programs. In their problem formulation, a studentcan have one of 20 misconceptions about geometric concepts, andthe goal is to classify which misconceptions a student has fromtheir program. Malik and Wu et al. [17] introduce a method forneural approximate parsing of probabilistic grammars, achievinghuman-level accuracy on misconception classification. We adapttheir RNN model in Sections 4.2.

In a classification problem, a fixed set of categories is given upfront. The role of a model is to classify data into one of these prede-termined categories. Clustering methods attempt to solve a morechallenging problem by simultaneously discovering the categoriesand the mapping from data to category. In the education community,prior work in program clustering has used classical heuristics suchas computing edit distance between abstract syntax trees [12] andcontrol-flow ASTs [11], or finding exact matches on canonicalizedsource code [8], control-flow graphs [15], or simulation relations[9]. Edit distance has also been applied to program repair [2] andcode clone detection [26]. We do not attempt to solve the problem of discovering programcategories. In practice, one of the major challenges for clustering-based analyses is that the generated categories can range fromhard to interpret to nonsensical. In our problem setting, we assumea teacher or CSE researcher has already identified categories ofinterest, and wants to label them at scale on a dataset of programs.However, we do use the notion of program distance via syntax-treeedits for our nearest-neighbors classifier.Recent work has also applied machine learning techniques tolearn program comparison metrics from data. Tufano et al. [23]use a recursive autoencoder on identifiers, syntax trees, controlflow graphs, and bytecode to build a semantic embedding spacefor programs, then use embedding space distance for clone detec-tion. Raychev et al. used decision trees [19] and conditional randomfields [20] to learn associations between code fragments for predict-ing the values and types of holes in programs. While these metricsare likely more robust than tree edit distance, they are challeng-ing to adapt for niche teaching languages like Pyret. For example,Raychev et al. used 150,000 JavaScript files to learn an embeddingfor JavaScript, and we strongly suspect we cannot find that muchPyret code out in the world.

Our goal is to evaluate program structure classification methodson programs and classes relevant to CSE researchers, i.e. classi-fying strategies on student programs, not classifying the kind ofproblem solved in LeetCode solutions. We chose to replicate thehand-labeled program structures used in Fisler’s study of plan com-position in functional solutions to the Rainfall problem [7]. In thatstudy, students were prompted with:Design a program called rainfall that consumes a listof numbers representing daily rainfall amounts asentered by a user. The list may contain the number-999 indicating the end of the data of interest. Producethe average of the non-negative values in the list up tothe first -999 (if it shows up). There may be negativenumbers other than -999 in the list.Fisler’s dataset contains student solutions from different introduc-tory programming classes in three functional languages: OCaml,yret, and Racket . Across these languages, Fisler identified threehigh-level structures (“Single Loop”, “Clean First”, and “Clean Mul-tiple”) that accounted for a large majority of student solutions. Eachcategory indicates a different choice of when and how to filter the in-put list for valid rainfall data. Single Loop fuses summing/countingwith filtering, Clean First filters the list then sums and counts theclean data, and Clean Multiple separately filters in the summingand counting logic. Figure 1 shows examples of two Clean FirstOCaml programs, and Section 4 of Fisler’s paper contains furtherdiscussion.Overall, the dataset consists of 136 OCaml and 42 Pyret studentsolutions to the Rainfall problem. The distribution of Clean First/ Clean Multiple / Single Loop solutions is .44/.20/.36 in OCamland .47/.22/.31 in Pyret. Each solution’s structure has been hand-labeled by a human expert (either Fisler or the current authors).In Section 4, we describe the methods for automatically classify-ing Rainfall program structures, and in Section 5 we evaluate themethods on Fisler’s dataset. We selected which methods to evaluate based on several criteria: • Generality : we prefer methods that could work with little cus-tomization for many languages and problems. So we eliminatedany heuristic-based methods (e.g. as in Taherkhani and Malmi [21]),and only considered methods that learn directly from data. • Interpretability : machine learning methods often trade off in-terpretability for predictive power. Neural networks are notedfor their high accuracy and black-box nature, so we wanted toinclude classical machine learning methods as well. • Supervision : as mentioned in Section 2.2, we only want to con-sider methods that learn provided categories, not emergent onesfrom the data. So we eliminate any unsupervised machine learn-ing methods from consideration.Given these criteria, we selected two methods: a nearest-neighborsclassifier, and a recurrent neural network. We will explain each ingreater detail:

A nearest-neighbors classifier represents a baseline for supervisedprogram structure classification. It has a simple formulation, re-quires no training algorithm, and makes interpretable decisions. Toexplain, let’s set up the mathematical structure of the problem. Theinput is a dataset D = {( 𝑝 , 𝑙 ) , . . . , ( 𝑝 𝑛 , 𝑙 𝑛 )} of 𝑛 pairs of programs 𝑝 and labels 𝑙 . For example, the two programs in Figure 1 both have 𝑙 = "Clean First".Given a new program 𝑝 ′ , a nearest-neighbors classifier finds themost similar program 𝑝 𝑖 from the training data, and assigns 𝑝 ′ thelabel 𝑙 𝑖 . Formally, given a distance function: Dist : Program × Program → R A nearest-neighbors classifier classifies a new program 𝑝 ′ as: argmin 𝑝,𝑙 ∈D Dist ( 𝑝, 𝑝 ′ ) We exclude Racket from our analysis because Fisler’s Racket data were PDFs ofhand-written exam solutions, which we cannot automatically analyze like a text file.

Source codeToken stringsToken numbers EmbeddingRNN [0.1, 0.4, …]

Classifier “Clean First”

Preprocessing Classification let rainfall = …[“LET”,“RAINFALL”,…][1, 15, 10, …] [1.3, 6.8, …] Figure 2: RNN classifier architecture. Programs are con-verted into integer sequences where each number uniquelyidentifies a token. Each token is mapped to an embeddingvector, fed through the RNN to produce a hidden state vector.At the end of the sequence, a classifier predicts the programcategory from the hidden state.

This method is interpretable in that the classifier can provide thenearest training program 𝑝 as a justification for its classification.See Figure 5 for an example.The key design decision is choosing a distance metric. One met-ric that has been widely used for program similarity is tree editdistance . When two programs are represented as their abstract syn-tax tree (AST), the edit distance is the number of tree manipulationoperations needed to transform one tree into the other. For ourclassifier, we use the canonical Zhang-Shasha method [29]. Nearest-neighbors could, of course, be used with other distance metrics (e.g.Euclidean distance in a learned embedding space). But for simplic-ity in this paper, we will use “nearest-neighbors” to mean “withZhang-Shaha distance.”Similar to prior work [11], we do not compare syntax trees ver-batim. Small syntactic differences like choice of variable name orpresence of type annotation do not usually impact the high-levelstructure of a program. For both OCaml and Pyret, we use theirrespective compilers to generate a raw AST, then erase variablenames, constant values, and type annotations before computingedit distance. A downside to nearest-neighbors is that the distance metric issusceptible to issues where programs may be syntactically similarbut structurally different (or vice versa, as in Figure 1). Neuralnetworks are a supervised learning method that automatically learnprogram similarity features from the training data. In practice,learned features can increase accuracy (with enough training data),but decrease interpretability. We use a recurrent neural networkbecause our input programs do not have a fixed size, unlike e.g.convolutional neural networks for image classification. We cannot simply use a standard feed-forward neural network with a large input size,since that presumes we know the maximum program size at training time. A studentcould always produce a program larger than previously observed. A cc u r a c y OCaml

Nearest-neighborsRecurrent neural network 8 16 25 33Training data0.00.20.40.60.81.0 A cc u r a c y Pyret

Figure 3: Distribution of model accuracy for different sizesof training and test sets and under each language. Each ex-perimental condition is computed through 30 trials, so itsdistribution is visualized as box plot.

We adapt a basic RNN architecture from Wu et al. [27] as shownin Figure 2. Each token of the source program is mapped to a high-dimensional vector of numbers (an “embedding”). . The RNN isinitialized with a different high-dimensional vector of numbers (the“hidden state”). Given a sequence of token embeddings, the RNNiterates through each token and updates the hidden state with in-formation from the embedding. At the end of the sequence, logisticregression is used to classify the hidden state into a probabilitydistribution over the possible program structure categories.Given the breakneck pace of research on neural networks, thereare inevitably countless variations on this architecture that could beapplied to our problem. We used the most basic possible recurrentarchitecture over e.g. transformers [14, 24] or tree-LSTMs [22] toreduce the number of confounding factors that influence accuracy.We consider the specific choice of RNN cell (LSTM [10] vs. GRU [5]),the number of layers within the RNN, and the embedding/hiddenvector sizes as hyperparameters.To train the RNN model on labeled student data, we use gradientdescent with Adam [13] ( 𝛼 = . 𝛽 = . 𝛽 = . For each method, we answer the three research questions raised inthe introduction.

To estimate the accuracy of a particular method, we can partitionthe Rainfall dataset into training data and testing data. The methodis trained on the training data with knowledge of their ground-truth program structures. Then the method is evaluated on the Although a token is represented as a number, tokens are still categorical, not ordinaldata. For example, if let = 1, fun = 2, end = 10, the network shouldn’t learn that “ let ”and “ fun ” are somehow more related than “ end ” by virtue of being assigned closeridentifiers. C l ean F i r s t C l ean M u l t i p l e T r ue p l an S i ng l e Loop

Figure 4: Distribution of row-normalized confusion matri-ces for the RNN method in OCaml. The 𝑥 -axis of each sub-plot shows the probability 𝑃 ( Predicted plan | True plan ) , e.g.the upper right plot shows the probability a program is clas-sified as Single Loop given its true plan is Clean First. The 𝑦 -axis shows the number of times a confusion matrix had agiven probability across the 30 simulations. Kernel densityestimation is used to smooth the empirical probability dis-tribution. The nearest-neighbors matrices look very similar,and so are not shown. testing data without knowing their ground-truth. For now, eachtest prediction is either correct if the predicted category matchesthe actual category, and incorrect otherwise. (Section 5.3 furtherdistinguishes between types of errors.) The accuracy of the modelis then the fraction of test programs with a correctly predictedcategory.Each train/test partition acts as a simulation of how the toolwould be used in practice. For example, if a teacher has 100 studentsolutions, they could hand-label 30 of them with the correct pro-gram structure, then train a model to predict the remaining 70. Tounderstand the “average” scenario, we can run this simulation mul-tiple times on random partitions to find a distribution of accuracy.Technically speaking, we use a Monte Carlo cross-validation.To understand the relationship between accuracy and trainingdata, we fix different amounts of training data as the independentvariable, and measure the distribution of accuracy across 30 simu-lations as the dependent variable. Figure 3 shows the distributionfor each method over both languages at each quartile of trainingdata. Some observations: • The highest mean accuracy in OCaml is . ± .

04 for nearest-neighbors and . ± .

05 for RNN models when training on 108programs. For Pyret, the highest mean is . ± .

19 for nearest-neighbors and . ± .

10 for RNN on 33 training programs. • Nearest-neighbors achieves . ± .

04 mean accuracy for OCamlwith only 27 training programs. • Nearest-neighbors outperforms RNN under every training setsize for OCaml, and the converse is true for Pyret.For OCaml, relatively little training data is needed to achieve highaccuracy with nearest-neighbors, while at least 100 data points isneeded for RNN to achieve comparable accuracy. For Pyret, bothmethods have uncomfortably high variance even with 33 training let rec rainfall_help1 (alon : int list ) = match alon with | [] -> 0 | (-999)::tl -> 0 | hd::tl -> hd + (rainfall_help1 tl) let rec rainfall_help2 (alon : int list ) = match alon with | [] -> 0 | (-999)::tl -> 0 | hd::tl -> 1 + (rainfall_help2 tl) let rainfall (alon : int list ) = (float_of_int (rainfall_help1 alon)) /. (float_of_int (rainfall_help2 alon)) let rec positive lst = match lst with | [] -> [] | head::tail -> ( match head with | (-999) -> [] | x when x > 0 -> head :: (positive tail) | x when x < 0 -> positive tail) let rec sum_list lst = match lst with | [] -> 0 | head::tail -> head + (sum_list tail) let rec rainfall lst = (sum_list (positive lst)) / ( List .length (positive lst))

Figure 5: The left program is a Clean Multiple solution that was misclassified by the nearest-neighbors classifier as CleanFirst, being matched with the Clean First program in the training set on the right. The functions share significant syntacticand structural similarity, e.g. two helper functions, similar style of match, and similar top-level usage. However, the helperfunctions are critically used in very different ways, leading to an incorrect classifier prediction. programs, suggesting that more training data is needed for high,stable accuracy on Pyret programs with these methods.

We have to carefully compare the results of Figure 3 between OCamland Pyret due to the difference in dataset size. By comparing e.g.the 27-size OCaml condition vs. the 25-size Pyret condition, we canget close to a fair comparison. Observations: • Nearest-neighbors gets . ± .

04 mean accuracy for OCaml, and . ± .

08 mean accuracy for Pyret. • RNN gets . ± .

04 mean accuracy for OCaml, and . ± .

08 meanaccuracy for Pyret.For small amounts of data, the RNN performs consistently withabout 79% accuracy in both cases. In OCaml, nearest-neighborsoutperforms the RNN with 85% accuracy, but is substantially worsefor Pyret with 67% accuracy. This difference suggests that the RNNis more stably language-independent, while nearest-neighbors’ per-formance is language-dependent.For nearest-neighbors, the performance gap between the twolanguages is possibly explained by AST size. While the averageprogram length in tokens is 116 for OCaml vs. 127 for Pyret, theaverage program size in number of AST nodes is 50 for OCaml and196 for Pyret. Hence, the token-based RNN sees programs of muchsimilar size than the node-based nearest-neighbors classifier. Thisdifference is likely an artifact of the implementation of ASTs in therespective compilers. Further work in simplifying the AST couldpotentially improve nearest-neighbors performance.

First, to understand statistically when errors are most likely occur,we ran 30 simulations of both models for the OCaml dataset ina 70/30 train/test split. Rather than evaluating accuracy (is theprediction correct or not?), we consider the more granular statisticof a confusion matrix: for each category, how often is it classifiedas a different category? Each simulation generates one confusion matrix, which we summarize as a distribution over matrices inFigure 4.The plot shows that true Clean First programs are misclassifiedless often than the other two classes. Clean Multiple has a greatermisclassification rate, being most frequently confused with CleanFirst. And Single Loop is almost exclusively misclassified as CleanMultiple, a somewhat confusing asymmetry given Clean Multi-ple is rarely misclassified as Single Loop. These observations areconsistent between both nearest-neighbors and RNN.Next, to understand the interpretability of these errors, we willanswer a particular “why” question: why is Clean Multiple oftenmisclassified as Clean First? Starting with nearest-neighbors: recallthat programs are classified by their edit-distance to the nearestprogram in the training set. Given an incorrectly classified program,we can look at the closest training program to understand why theerror occured.Figure 5 shows a representative example of a Clean Multipleprogram misclassified as Clean First by nearest-neighbors. The twoprograms shared many syntactic features (multiple helper func-tions, use of standard library functions, similar matching structure),but were subtly distinct in how these pieces of code were used.Through manual inspection, we found most of the errors fromnearest-neighbors were caused by such incidental syntactic simi-larities.As with many neural-network approaches, the RNN providesno immediately human-interpretable ways to understand its pre-dictions. However, research in interpretable machine learning hasproduced methods of visualizing the internal representations ofobjects within a neural network. Once a program has been fedto the RNN, it generates a "hidden state" vector of numbers. Us-ing the t-SNE method [16], we can project that high-dimensionalrepresentation to a 2D plane as shown in Figure 6.Given a particular random 70/30 split on the OCaml dataset, wegenerated a t-SNE diagram by projecting the 92 training programsonto a 2-D scatterplot, shown in blue. Then we add the 6 incorrectlyclassified test programs, shown in orange. Each point’s shape indi-cates its actual category. The t-SNE diagram reveals that roughly earest-neighbors RNNHow much training does this toolneed to be accurate?

Few programs for small ASTs, many forlarge ASTs More than nearest-neighbors to reach peakaccuracy

How accurate is the tool on differ-ent programming languages?

Worse as AST size increases Consistent across languages

When and why does this tool makeerrors?

Confuses syntactically similar and seman-tically different programs Learns an embedding space that doesn’tseparate categories well enough

Table 1: Summary of answers to each research question for each method.

12 10 8 6 4 2 0 2 4t-SNE X150510 t - S N E X DatasetTrainTestTrue labelClean FirstClean MultipleSingle Loop

Figure 6: A scatter plot of the t-SNE projection of hiddenstate vectors for OCaml programs. Blue is training data, andorange is incorrectly classified test data. three clusters emerge, one for each category: squares in the top-left(Single Loop), crosses in the middle (Clean Multiple), and circles inthe bottom-right (Clean First). When a program is misclassified, itsembedding is closer to the cluster of another category than its own.For the question of Clean Multiple vs. Clean First: the t-SNE dia-gram shows that the RNN embedding space learns to position CleanMultiple programs between

Clean First and Single Loop. Hence,why Clean Multiple is disproportionately misclassified as one orthe other, and why Single Loop is rarely misclassified as Clean First.Additionally, the one misclassified Clean Multiple program (theorange cross in Figure 6) is closer to Clean First training pointsthan to Clean Multiple training points, a likely explanation for itsmisclassification. In sum, the RNN is not learning an embeddingspace that keeps programs of each category sufficiently far apart.

We summarize the answers to our research questions in Table 1.Overall, we have found that these methods have the potential toclassify functional Rainfall programs with relatively high accuracy(90%+) without huge amounts of training data. The methods workacross languages and have varying levels of interpretability.

While we hope that these methods can be used by teachers fortheir own programming problems, this study’s conclusions maynot generalize beyond problems like Rainfall. For example: • Problem complexity:

Rainfall is a simple problem, whose solu-tions in OCaml and Pyret are 10-20 lines. Methods like nearest-neighbors may not scale to more complex problems. • Language/paradigm: the dataset’s programming languages wereboth functional. These methods may not perform as well on Java,Python, or other more standard CS1 languages. • Number of categories: with more program structures, moredata is needed to distinguish between them.As more datasets become available with CSE-relevant program clas-sification tasks, we hope that these concerns can be addressed infuture work. Additionally, as more powerful machine learning meth-ods are developed, they can be applied to overcome the limitationsof the methods evaluated here.

Thus far, program classification tools have primarily been the do-main of CS education researchers with machine learning expertise.We hope that tools with UIs designed for ease-of-use and trans-parency/debuggability will make this technology accessible to allCS educators. For starters, all code from this project is free andopen-source at https://github.com/willcrichton/autoplan. We havedeveloped a simple Python API to simplify data preprocessingand model training specifically for program classification of Pyret,OCaml, Java and Python programs.The bigger question is: how would teachers use such a tool?We expect that teachers could use program classification to gainvisibility into the strategies used by students without needing toread every program. The average workflow might look like this:(1) A teacher notices in office hours that some students are writingtheir solutions a particular way, e.g. performing multiple vali-dation checks up front vs. performing them lazily throughoutthe program (such as Clean First vs. Clean Multiple).(2) After collecting assignment solutions, the teacher finds a fewexamples of solutions that do and don’t match this pattern.(3) The teacher trains a classification model on the examples, anduses it to find more similar programs to label.(4) The teacher repeats this workflow until they are confident thatthe model is accurate for their dataset based on cross-validationsimulations like those in this paper.(5) They run the model on the entire solution dataset, revealingthat 1/3 of the class is using the lazy validation strategy.(6) The teacher prefers students to validate eagerly, and so updatestheir teaching materials to cover this issue in class.We hope that this study can contribute toward the foundationalknowledge needed to make this process possible.

ACKNOWLEDGEMENTS

We are deeply grateful to Kathi Fisler for providing us access to theRainfall dataset.

REFERENCES [1] Umair Z. Ahmed, Renuka Sindhgatta, Nisheeth Srivastava, and Amey Karkare.2019. Targeted Example Generation for Compilation Errors. In

Proceedings of the34th IEEE/ACM International Conference on Automated Software Engineering .[2] Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra. 2019. Getafix:Learning to fix bugs automatically.

Proceedings of the ACM on ProgrammingLanguages

3, OOPSLA (2019), 1–27.[3] James Bergstra, Dan Yamins, and David D Cox. 2013. Hyperopt: A pythonlibrary for optimizing the hyperparameters of machine learning algorithms. In

Proceedings of the 12th Python in science conference , Vol. 13. Citeseer, 20.[4] Kevin W Bowyer and Lawrence O Hall. 1999. Experience using "MOSS" to detectcheating on programming assignments. In

FIE’99 Frontiers in Education. 29thAnnual Frontiers in Education Conference. Designing the Future of Science andEngineering Education. Conference Proceedings (IEEE Cat. No. 99CH37011 , Vol. 3.IEEE, 13B3–18.[5] Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio.2014. On the Properties of Neural Machine Translation: Encoder–Decoder Ap-proaches. In

Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics andStructure in Statistical Translation . Association for Computational Linguistics,Doha, Qatar, 103–111. https://doi.org/10.3115/v1/W14-4012[6] Bui Nghi DQ, Yijun Yu, and Lingxiao Jiang. 2019. Bilateral Dependency NeuralNetworks for Cross-Language Algorithm Classification. In .IEEE, 422–433.[7] Kathi Fisler. 2014. The recurring rainfall problem. In

Proceedings of the tenthannual conference on International Computing Education Research . ACM, 35–42.[8] Elena L Glassman, Jeremy Scott, Rishabh Singh, Philip J Guo, and Robert C Miller.2015. OverCode: Visualizing variation in student solutions to programmingproblems at scale.

ACM Transactions on Computer-Human Interaction (TOCHI)

22, 2 (2015), 7.[9] Sumit Gulwani, Ivan Radiček, and Florian Zuleger. 2018. Automated Clusteringand Program Repair for Introductory Programming Assignments.

SIGPLAN Not. (2018).[10] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.

Neuralcomputation

9, 8 (1997), 1735–1780.[11] David Hovemeyer, Arto Hellas, Andrew Petersen, and Jaime Spacco. 2016.Control-flow-only abstract syntax trees for analyzing students’ programmingprogress. In

Proceedings of the 2016 ACM Conference on International ComputingEducation Research . ACM, 63–72.[12] Jonathan Huang, Chris Piech, Andy Nguyen, and Leonidas Guibas. 2013. Syntacticand functional variability of a million code submissions in a machine learningmooc. In

AIED 2013 Workshops Proceedings Volume , Vol. 25.[13] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-mization. arXiv preprint arXiv:1412.6980 (2014).[14] Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, and Guillaume Lample.2020. Unsupervised Translation of Programming Languages. arXiv preprintarXiv:2006.03511 (2020).[15] Andrew Luxton-Reilly, Paul Denny, Diana Kirk, Ewan Tempero, and Se-Young Yu.2013. On the differences between correct student solutions. In

Proceedings of the18th ACM conference on Innovation and technology in computer science education .ACM, 177–182.[16] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.

Journal of machine learning research

9, Nov (2008), 2579–2605.[17] Ali Malik, Mike Wu, Vrinda Vasavada, Jinpeng Song, John Mitchell, Noah Good-man, and Chris Piech. 2019. Generative Grading: Neural Approximate Parsingfor Automated Student Feedback. arXiv preprint arXiv:1905.09916 (2019).[18] Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional neuralnetworks over tree structures for programming language processing. In

ThirtiethAAAI Conference on Artificial Intelligence .[19] Veselin Raychev, Pavol Bielik, and Martin Vechev. 2016. Probabilistic model forcode with decision trees. In

ACM SIGPLAN Notices , Vol. 51. ACM, 731–747.[20] Veselin Raychev, Martin Vechev, and Andreas Krause. 2015. Predicting programproperties from big code. In

ACM SIGPLAN Notices , Vol. 50. ACM, 111–124.[21] Ahmad Taherkhani and Lauri Malmi. 2013. Beacon-and schema-based methodfor recognizing algorithms from students ′ source code. Journal of EducationalData Mining

5, 2 (2013), 69–101.[22] Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. ImprovedSemantic Representations From Tree-Structured Long Short-Term Memory Net-works. In

Proceedings of the 53rd Annual Meeting of the Association for Computa-tional Linguistics and the 7th International Joint Conference on Natural LanguageProcessing (Volume 1: Long Papers) . Association for Computational Linguistics, Beijing, China, 1556–1566. https://doi.org/10.3115/v1/P15-1150[23] Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, MartinWhite, and Denys Poshyvanyk. 2018. Deep learning similarities from differentrepresentations of source code. In . IEEE, 542–553.[24] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in neural information processing systems . 5998–6008.[25] Lisa Wang, Angela Sy, Larry Liu, and Chris Piech. 2017. Learning to RepresentStudent Knowledge on Programming Exercises Using Deep Learning.. In

EDM .[26] Pengcheng Wang, Jeffrey Svajlenko, Yanzhao Wu, Yun Xu, and Chanchal K Roy.2018. CCAligner: a token based large-gap clone detector. In

Proceedings of the40th International Conference on Software Engineering . ACM, 1066–1077.[27] Mike Wu, Milan Mosse, Noah Goodman, and Chris Piech. 2019. Zero shot learningfor code education: Rubric sampling with deep learning inference. In

Proceedingsof the AAAI Conference on Artificial Intelligence , Vol. 33. 782–790.[28] Lisa Yan, Nick McKeown, and Chris Piech. 2019. The PyramidSnapshot Challenge:Understanding student process from visual output of programs. In

Proceedingsof the 50th ACM Technical Symposium on Computer Science Education . ACM,119–125.[29] Kaizhong Zhang and Dennis Shasha. 1989. Simple fast algorithms for the editingdistance between trees and related problems.