[PDF] Finding Anomalies in Scratch Assignments

Abstract

In programming education, teachers need to monitor and assess the progress of their students by investigating the code they write. Code quality of programs written in traditional programming languages can be automatically assessed with automated tests, verification tools, or linters. In many cases these approaches rely on some form of manually written formal specification to analyze the given programs. Writing such specifications, however, is hard for teachers, who are often not adequately trained for this task. Furthermore, automated tool support for popular block-based introductory programming languages like Scratch is lacking. Anomaly detection is an approach to automatically identify deviations of common behavior in datasets without any need for writing a specification. In this paper, we use anomaly detection to automatically find deviations of Scratch code in a classroom setting, where anomalies can represent erroneous code, alternative solutions, or distinguished work. Evaluation on solutions of different programming tasks demonstrates that anomaly detection can successfully be applied to tightly specified as well as open-ended programming tasks.

Full PDF

FFinding Anomalies in Scratch Assignments

Nina Körber

University of Passau

Passau, Germany

Katharina Geldreich

Technical University of Munich

Munich, Germany

Andreas Stahlbauer

University of Passau

Passau, Germany

Gordon Fraser

University of Passau

Passau, Germany

Abstract —In programming education, teachers need to mon-itor and assess the progress of their students by investigatingthe code they write. Code quality of programs written intraditional programming languages can be automatically assessedwith automated tests, veriﬁcation tools, or linters. In manycases these approaches rely on some form of manually writtenformal speciﬁcation to analyze the given programs. Writingsuch speciﬁcations, however, is hard for teachers, who are oftennot adequately trained for this task. Furthermore, automatedtool support for popular block-based introductory programminglanguages like S

CRATCH is lacking. Anomaly detection is an ap-proach to automatically identify deviations of common behaviorin datasets without any need for writing a speciﬁcation. In thispaper, we use anomaly detection to automatically ﬁnd deviationsof S

CRATCH code in a classroom setting, where anomalies canrepresent erroneous code, alternative solutions, or distinguishedwork. Evaluation on solutions of different programming tasksdemonstrates that anomaly detection can successfully be appliedto tightly speciﬁed as well as open-ended programming tasks.

Index Terms —Anomaly Detection, Scratch, Block-Based Pro-gramming, Program Analysis, Teaching

I. I

NTRODUCTION

Teachers frequently have to evaluate students’ implementa-tions of programming assignments to provide feedback andsupport, assess progress, identify recurring problems, and toderive grades. These tasks are challenging because they requirecomprehending, analyzing, and debugging different programvariants, often containing creative and unique bugs.These tasks can be supported with automated softwareanalysis tools; for example, a common way to assess thecorrectness of student solutions is to run automated tests.However, programming is increasingly taught at earlier ages,often as early as elementary school, using educational pro-gramming languages such as S

CRATCH . This causes severalissues: First, automated tools that are common for advanced,text-based programming languages are rarely available forthese educational programming languages. Even when they are,teachers at elementary school level often have no training inhow to formalize speciﬁcations or automated tests; indeed evenprofessional developers often fail to produce adequate tests.Finally, even a thorough test suite may fail to reveal programsthat produce the correct result using an incorrect solution path.To address this problem, we propose the use of anomalydetection for classroom programming scenarios. Anomalydetection is based on the idea that common behavior ismore likely correct behavior, and that rare deviations ofcommon behavior (so called anomalies ) are likely wrong. Inthe context of software engineering, anomaly detection has (a) Correct script to move the spriteﬁve steps whenever the space key ispressed. (b) Wrong block use: The sprite willgo to the same position on the stageevery time the space key is pressed.

Fig. 1: Two scripts aiming to implement the same functionality.been successfully applied to ﬁnd bugs in large code basesrequiring no speciﬁcation, no tests, and no manual labor. Whilecode bases in an educational setting tend to be small, they docontain common code constructs which can be exploited toﬁnd anomalies that deviate from the common solutions.Fig. 1a shows a common programming example in S

CRATCH :the script continuously checks if the user has pressed the spacekey, and whenever this happens the sprite is moved by ﬁvesteps. Fig. 1b shows a script that tries to accomplish the samebut uses a wrong block: Instead of the move steps block, the go to position block is used. Generic linters would miss thisbug as it is project-speciﬁc and does not violate any generalprogramming concepts. Even an automated test only pressingthe space key once would incorrectly report this behavior ascorrect. Given a dataset of students’ solutions for this task,anomaly detection learns common patterns such as to use a move steps block whenever the when green ﬂag , forever and if-then blocks are combined. Consequently, the buggy scriptin Fig. 1b would be ﬂagged as an anomaly.In this paper, we introduce the concept of anomaly detectionin the classroom. In detail, the contributions of this paper are: • We formally introduce and implement anomaly detectionfor S

CRATCH (Section III). • We empirically evaluate the practical applicability ofanomaly detection for S

CRATCH (Section IV).Evaluated on a dataset of six S

CRATCH programmingprojects with many different student solutions, our implemen-tion of anomaly detection for S

CRATCH demonstrates thatanomaly detection is a reliable way to ﬁnd generic defects aswell as project-speciﬁc ones, such as the one in Fig. 1, withoutany manual labor required from teachers. a r X i v : . [ c s . S E ] F e b II. B

ACKGROUND

Since programming knowledge, skills and mental modelscannot be effectively acquired in the abstract, programmingeducation is heavily based on practical exercises [15]. Studentstypically implement similar tasks based on textual speciﬁcationsof what the programs should achieve, practicing conceptsthey ﬁrst learned about in theory. In the sense of formativeand summative assessment, the results of such tasks canprovide educators with clues that they can use to evaluateand improve the students’ learning [13]: Teachers, tutors, andautomated tutoring systems need to interact with students duringassignments to provide feedback and help during exercisesessions, or to evaluate and grade submissions; this appliesequally to textual and visual programming languages. In thissection we explore what means for support exist in this setting,and how anomaly detection can help, focusing particularly onthe visual programming language S

CRATCH . A. Evaluating Student Programs

In order to teach programming, educators need to havecontent knowledge (CK) as well as pedagogical contentknowledge (PCK) [20]. The latter is required for planningand conducting programming lessons and comprises variousaspects that inﬂuence the learning process. According to themodel of Magnusson et al. [27], PCK includes, amongst others,knowledge about suitable assessment strategies to evaluatestudents’ understanding. This aspect is particularly important,as studies show that teachers’ insufﬁcient understanding oftheir students reduces the quality of their teaching [35].While Grover argues for a range of different assessmenttypes during learning to program [12], the most obvious andcommon method of assessing programming skills is to evaluatethe students’ programs [22]. This can provide important insightsto educators—e.g. by exposing misconceptions or gaps in thestudents’ understanding—but is particularly challenging fornovice or inexperienced teachers [37].A primary means to support the analysis of learners’ codeis by running automated tests against the solutions. Themost common application for this is automated grading: Byimplementing individual tests for the various requirements thata program should satisfy, the resulting grade can be determinedas the ratio of tests that a submission passes. This generalprinciple is implemented in numerous grading tools, whichare summarized in various surveys [2], [7], [21]. Automatedtests can also serve as feedback to students, or as the basis forproducing hints and corrections [14], [33].In practice, a primary challenge for the application ofautomated tests lies in their creation. First, it requires theexistence of appropriate automation frameworks in whichto specify and execute these tests, which are not alwaysavailable. Second, creating suitable tests is challenging, evenfor professional developers [3], [4], [32], [34].Static analysis tools are sometimes applied for checkingstyle, code smells, and bugs in student code. For example, theindustrial strength F

IND B UGS [19] tool has been investigated inan educational domain [8], and can be integrated into the build tool chain of modern autograders [24]. Such static analysistools require no speciﬁcation effort from the teacher, but thescope of the feedback they can produce is limited: They canonly report generic, assignment-independent issues.

B. The S CRATCH

Programming Language S CRATCH [28] is a block-based programming language.Programmers can choose from over one hundred blocks which resemble puzzle pieces. The blocks can be composedvisually with each other in the S CRATCH editor to deﬁnethe behavior of S CRATCH programs. A collection of blocksthat are connected to one unit is called a script . Usually, ascript begins with a hat block which is an event listener. Thehat block is followed by an arbitrary number of blocks thatdeﬁne the actions to execute after the event of the hat blockwas triggered. Scripts belong to actors [38], that is, either thestage or one of the sprites. The stage is the background ofthe program; sprites are the objects acting on the stage. Fig. 1illustrates two S

CRATCH scripts, both are triggered by thegreen-ﬂag event—that is, start executing when the programstarts—by clicking the green ﬂag symbol in the S

CRATCH editor. More details on S

CRATCH and formalizations thereofcan be found in the literature [28], [38], [39].Blocks have different shapes and colors to distinguishbetween different categories of statements and expressions,for example, event listeners, or control structures. Generally,we distinguish between command blocks and reporter blocks.When executed, a command block performs different actionsunder speciﬁed conditions. Hat blocks, control blocks, stackblocks, and cap blocks are types of command blocks. A reporterblock describes an expression to evaluate and produces a scalarvalue, for example, an integer, Boolean, or a string.

C. Program Analysis for S CRATCH

The increasing popularity of S

CRATCH as an introductoryprogramming environment has triggered research on analyzingthe resulting programs. In particular, the observation thatS

CRATCH programmers tend to develop certain negative habitswhile coding [30] has led to investigations into the generalquality problems in S

CRATCH programs using static analysistools. It has been shown that various types of code smellsare prevalent [1], [17], [36], [40] and have a negative impacton code understanding [16]. There are tools for ﬁnding codesmells in S

CRATCH programs such as H

AIRBALL [5], Q

UALITYHOUND [40] or SAT [6], and L

ITTER B OX [10] detectspredeﬁned bug patterns automatically.Testing frameworks have also been proposed for S CRATCH .In particular, I

TCH [23] translates a small subset of S

CRATCH programs (say/ask blocks) to Python programs and then runstests on these programs. The W

HISKER tool [39] executesautomated tests directly in the S

CRATCH

IDE, and supportsproperty-based testing. B

ASTET [38] provides a general pro-gram analysis framework that can be used for any conﬁgurableprogram analysis, such as software model checking. https://en.scratch-wiki.info/wiki/Blocks, last accessed February 12, 2021 https://scratch.mit.edu/projects/editor/, last accessed February 12, 2021 D. Anomaly Detection

An alternative to the common types of program analysisdescribed above is offered by the concept of anomaly detection.The general principle is that likely rules about softwareprojects, programming practices, or API usage are inferredautomatically from source code, version histories, or executiontraces. Violations of these rules (anomalies), are then reportedas likely bugs. The quality of the reported violations dependson how rules are encoded, the algorithms used for mining therules and determining outliers, as well as the data source.There is a variety of technical approaches: Techniques basedon frequent itemset mining techniques capture co-occurrencesof methods and variables [25], [26]. These techniques canbe extended to capture control ﬂow information using graphmodels [9], [31], [43], [44]. For example, the J

ADET tool [44]extracts temporal properties that capture common sequences ofmethod calls on instances of J

AVA classes. An alternative liesin the use of n-gram language models to capture the regularitiesof software source code, and then to report aspects of codewith low probabilities as suspicious [41].A common assumption of these approaches is that anomalydetection is applied on large software projects, or on largecollections of software projects that share some properties (e.g.,common dependencies), such that the data mining algorithmssucceed in extracting relevant patterns. In contrast, programsin an educational context tend to be small and on their owndo not provide sufﬁcient opportunity for mining properties.However, in contrast to a regular software engineering scenariothere is redundancy in terms of multiple student solutions forthe same problem, which we aim to exploit in this paper.III. A

NOMALY D ETECTION FOR S CRATCH

In this section, we describe how anomaly detection can beimplemented for S

CRATCH programs. We build on an existingapproach that was presented for object-oriented programs [42],[44] and adjust it for S

CRATCH programs.

A. Modeling Control Flow with Script Models

We aim to ﬁnd violations of temporal activations of blocks,and therefore model the control ﬂow of S

CRATCH programs.The control ﬂow between blocks in a script is represented by itsscript model, which describes how the control of the programexecution ﬂow is passed between the blocks of a S

CRATCH program. Formally, we deﬁne a script model as follows:

Deﬁnition 1 (Script Model) . A script model is a tuple m =( L, B, G, l , L x ) , with a ﬁnite set L of control locations , a ﬁniteset B of command blocks , a control transition relation G ⊆ L × ( B ∪ { (cid:15) } ) × L , an initial control location l ∈ L , and aset of control exit locations L x ⊆ L . A control location can be reached by executing the blocks on the transitions in the controltransition relation, starting from the initial location l . Epsilon ( (cid:15) ) moves are used (1) for abstracting away commandblocks that are irrelevant for anomaly detection, and (2) asa convenience feature to create the script models. Epsilonelimination as known from (cid:15) -NFAs [18] is applicable. All l l l l if-then key pressed move steps l l l l if-then (cid:15) move steps l l l if-then move steps Fig. 2: Abstraction of script models l l l l when green ﬂagforeverif-then if-thenmove steps (a) Correct: Forever loop resultingin an inﬁnite sequence of moves. l l l l when green ﬂagforeverif-then if-thengo to position (b) Buggy: Differs in the blockused between l and l . Fig. 3: The script models of the scripts in Fig. 1. Nodes arelocations in the code, outgoing transitions are labeled withthe blocks that can be executed from this location. The scriptmodels show command blocks and abstract away block inputs.deﬁnitions that follow assume that script models are (cid:15) -free ,that is, that all (cid:15) -moves have been eliminated upfront.

Example 1.

Fig. 2 illustrates a script model and (1) howparticular blocks can be abstracted away by (2) replacing themby (cid:15) -moves and (3) eliminating the (cid:15) -moves in the end. In thisexample, the reporter block key pressed is removed to discovermore generic patterns.

Note that we generally abstract away reporter blocks in thiswork to discover more generic patterns.In contrast to a control ﬂow graph or automaton, a scriptmodel contains transitions that are labeled with control blocksdespite the fact that the semantics of these blocks is encodedinto the graph structure. Fig. 1 provides an example where thecontrol block forever must precede the control block if-then . Example 2.

Fig. 3 shows the script models of the S CRATCH scripts from Fig. 1. The nodes of this graph refer to controllocations in a script, and the edges in between them denoteblocks that can be executed from these locations.

B. Extracting Block Patterns from Script Models

Every single script (represented by a script model) imple-ments a set of temporal properties that deﬁne how the scriptbehaves over time. In a later step, behavioral patterns aremined by analyzing the temporal properties of a large set ofscript models—in contrast to related work [44], we do not useobject usage models. Before we deﬁne the notion of temporalproperties, we deﬁne the transitive closure of a script model:

Deﬁnition 2 (Transitive Closure) . Given an (cid:15) -free scriptmodel m = ( L, B, G, l , L x ) , we deﬁne the transitive clo-sure G + ⊆ L × B × L of its control transition relation G recur-sively as G + = G ∪ { ( l , b, l ) | ( l , b, l ) , ( l , b, l ) ∈ G + } . when green ﬂag ≺ foreverwhen green ﬂag ≺ if-thenwhen green ﬂag ≺ move stepsforever ≺ if-thenforever ≺ move stepsif-then ≺ move stepsif-then ≺ if-thenmove steps ≺ if-thenmove steps ≺ move steps (a) The textual representation ofthe temporal properties. when green ﬂagforeverif-then move steps (b) The graphical representationof the temporal property relation. Fig. 4: The temporal properties of the script model in Fig. 3a.

Deﬁnition 3 (Temporal Properties [44]) . The temporal propertyrelation ≺ ⊆ B × B of a script m ∈ M deﬁnes the pairsof blocks that occur one after the other in its control ﬂow,possibly interleaved with the execution of other blocks. Thatis, ≺ = { ( b , b ) | ( · , b , l ) ∈ G + ∧ ( l , b , · ) ∈ G + } . Wewrite b ≺ b if and only if ( b , b ) ∈≺ . We use the alternativenotation props ( m ) ⊆ B × B to denote the temporal propertiesof a given script m . In other words, the temporal property relation is deﬁned bythe blocks that we can reach eventually in the script modelstarting from a block at hand.

Example 3.

Using the temporal property relation we can nowanalyze pairs ( b , b ) ∈≺ of blocks, where one block b mayprecede the other block b . Fig. 4 shows the temporal propertyrelation for the script model illustrated in Fig. 3a. We use the notion of patterns to learn about commontemporal behavior of scripts (and their models), which is centralfor detecting anomalies (deviations from common patterns).

Deﬁnition 4 (Pattern [44]) . A pattern p ⊆ B × B is a set oftemporal properties, where one temporal property is a pair ofblocks. A pattern p is supported by a script m if p deﬁnes asubset of its temporal properties, that is, if p ⊆ props ( m ) . Theset of all possible patterns is denoted by the symbol P . Deﬁnition 5 (Pattern Support [44]) . Given a list of scripts m = (cid:104) m , . . . , m n (cid:105) , the support supp ( p, m ) → N of a pattern p is the number of scripts that support the pattern, that is, supp ( p, m ) = |{ m | p ⊆ props ( m ) ∧ m ∈ m }| . Example 4.

Consider the list m = (cid:104) m , m (cid:105) of script models,which correspond to the scripts illustrated in Fig. 1. Whenconsidering the set of the temporal properties in Fig. 4 as onepattern, this pattern has support based on the scripts m . Thescript in Fig. 1a adheres to every temporal property of thispattern, whereas the script in Fig. 1b does not exhibit several ofthe temporal properties of the pattern. Fig. 5 shows the missingtemporal properties of the script—indicated with the color redand dotted lines. As the script does not have a move steps block,all the temporal properties containing the block move steps aremissing. Therefore, the buggy script does not support the patternand it has a support of 1. when green ﬂagforeverif-then move steps Fig. 5: Comparison between the temporal properties of thescripts in Figs. 1a and 1b. The temporal property relation ofthe buggy script does not contain the properties related to themissing move block and therefore violates the pattern of thecorrect script. Missing properties are depicted red and dotted.Even though script models and block patterns are closelyrelated and their graphical representation is similar, there aresome key differences: The level of abstraction of patterns ishigher than the level of abstraction of script models. While ascript model only abstracts away reporter blocks, and thereforerepresents a limited set of scripts, there is an unlimited varietyof scripts that may support a pattern. For example, a temporalproperty like “ if-then ≺ if-then ” is supported by both a scriptin which a single if-then block occurs in a loop and a scriptin which there are two directly consecutive if-then blocks.The set of actual patterns found in a set of script models(with corresponding temporal property relations) is computedusing frequent itemset mining: Deﬁnition 6 (Frequent Itemsets [44]) . Frequent itemset min-ing freq : 2 B × B × N → B × B takes a set of sets of temporalproperties and a minimum support threshold k ∈ N as argumentand produces a set of patterns that occur in at least k sets. C. Violations of Block Patterns

Based on the concepts that we have described in previoussections, we now discuss how we identify anomalies inS

CRATCH programs. Anomaly detection can help to showthe absence of functionality. Note that anomaly detection isperformed on closed patterns only, which are deﬁned as follows:

Deﬁnition 7 (Closed Pattern [44]) . A pattern is called closed if each pattern that is a superset has less support.

Deﬁnition 8 (Violation [44]) . A script m violates a pattern p ifthe pattern is not a subset of the temporal properties of the script,that is, if and only if p (cid:54)⊆ props ( m ) . Violations hint at scripts that do not support every temporalproperty of a common pattern. Therefore, the violation of ablock pattern always consists of two sets of temporal properties:A set of sequential constraints which are adhered to, and a setof missing temporal properties—the deviation . Deﬁnition 9 (Deviation [42]) . Given a script m and a pattern p ,the deviation is the set of temporal properties devi ( m, p ) = p \ props ( m ) that are missing in the script. Example 5.

Fig. 5, which compares the temporal properties ofthe scripts in Fig. 1, shows the violation of the buggy script. The deviation consists of all ﬁve temporal properties related to the move steps block.

Not all violations hint at defects or contribute new knowledge.The conﬁdence value of a block pattern violation is deﬁnedby the conﬁdence of its deviation and measures how manyscripts exhibit the exact same deviation from the same patternthe violation violates.

Deﬁnition 10 (Violation Conﬁdence [44]) . Given a list ofscripts m = (cid:104) m , . . . , m n (cid:105) , a script m and a pattern p , the conﬁdence of a violation of pattern p of script m is theratio c = s/ ( s + v ) , with the support s = supp ( p, m ) andthe number of violations that violate p the same way m does: v = |{ m i | devi ( m i , p ) = devi ( m, p ) ∧ m i ∈ m }| . Deﬁnition 11 (Anomaly [44]) . An anomaly is a violation ofa block pattern by a script for that the violation conﬁdence isabove a particular threshold (minimum conﬁdence). The actual identiﬁcation of anomalies is implemented basedon Formal Concept Analysis. A lattice of closed patterns istraversed from the top element (the pattern with the highestsupport) down to elements with lower support (until a min-support limit is reached) [44]. The anomalies found are rankedand ﬁltered using methods from Association Rule Mining toreport anomalies likely pointing at erroneous behaviour [42].

D. Implementation

Our tool chain for anomaly detection for S

CRATCH usesan extended version of L

ITTER B OX [10] to generate acollection (cid:104) m , . . . , m n (cid:105) of script models for a collection ofS CRATCH projects. These script models are handed over toJ

ADET [44] to mine patterns and check for violations. J

ADET ’salgorithms for pattern and violation mining are not J

AVA -speciﬁc: This allowed us to adapt J

ADET to check S

CRATCH code without algorithmic adaptations. Note that while J

ADET was designed to operate on object usage models to check forcorrect API usage, we use script models that are not restrictedto code that interacts with particular variables or objects.

E. Application

We envision that a primary application for anomaly detectionis to support teachers during formative assessment: A majoradvantage of anomaly detection is that it highlights noteworthyor problematic behavior without requiring a detailed andlaborious inspection of all student programs. It therefore seemsparticularly suitable also for real-time feedback during pro-gramming classes. Anomaly detection could similarly supportsummative assessment, although teachers would in this caseneed to be particularly aware that common erroneous behaviordoes not represent anomalies. Besides a general understandingof what an anomaly is, however, no further training should berequired in order to use anomaly detection in the classroom. It isalso conceivable that anomaly detection could be integrated intohint generation techniques, such that students receive feedbackautomatically, without the need for teacher interaction. In thiscontext, richer data, for example using a history of previoussolutions to the task at hand, could help to improve the qualityof reported anomalies. TABLE I: Statistics describing the evaluation datasets

Project Solutions ∅ Blocks ∅ Statements ∅ Scripts ∅ Sprites WMCMonkey 130 5.48 4.56 2.03 2.06 2.83Elephant 130 9.45 9.40 1.18 1.16 1.87Cat 129 7.57 5.82 3.08 1.99 5.51Horse 73 3.70 2.89 1.10 1.10 1.86Fruit 42 54.26 38.50 6.86 3.05 16.57Open 295 34.45 28.61 7.38 4.37 13.92

IV. E

MPIRICAL E VALUATION

To investigate the practical applicability of anomaly detectionin S

CRATCH , we aim to empirically answer the followingresearch questions:

RQ1

Can anomalies be found in assignment solutions?

RQ2

Do erroneous solutions lead to more anomalies?

RQ3

Which categories of anomalies can be identiﬁed?We implemented our approach as an extension of L

ITTER -B OX [10] and J ADET [42], [44] and it is available at: https://github.com/se2p/scratch-anomalies

A. Datasets

We use a dataset consisting of student solutions for sixdifferent programs: • Monkey : The aim of this program is to make the sprite of acircus director continuously move towards a monkey [11]. • Elephant : The aim of this program is to simulate a dancingelephant by continuously switching its costumes (i.e.,images representing different poses) [11]. • Cat : A cat sprite should indicate with a speech bubblewhenever it catches the ball [11]. • Horse : A horse sprite should continuously change color,but when it touches the mouse pointer it should rotate [11]. • Fruit : The player controls a fruit bowl with the cursor keys,and has to catch fruit dropping down from the top [39]. • Open : For this dataset, the students ﬁrst implemented threetightly speciﬁed tasks for training, before they were askedto implement something similar to the previous tasks, butwere not given any further speciﬁcation of what programspeciﬁcally to create. Thus, unlike the other projects, thisis an open task and there is no speciﬁcation.For each of these tasks we collected student solutions duringprogramming sessions conducted by qualiﬁed teachers. Forthe

Monkey , Elephant , Cat , and

Horse tasks solutions wereproduced by primary school children aged – , the Open task was solved by children aged – , and the Fruit taskwas solved by children aged – . The numbers of solutionsas well as size and complexity metrics are stated in Table I.Note that we use the full datasets including empty projects ofstudents who did not engage at all, since this also representsthe actual use case of a teacher applying our approach. B. Anomaly Mining

To mine violations, we extract the script models for each ofthe six datasets, and then use J

ADET to mine violations.

Extracted Script Models:

Table II shows the number ofprojects and the resulting script models mined for every task.The creation process ﬁnished in less than two seconds for everydataset. All experiments on our datasets were conducted on anoff-the-shelf laptop computer as would be available to teachers.

Mining Parameters: J ADET offers four parameters to con-ﬁgure violation mining: The minimum support and minimumsize of a violated pattern, the maximum deviation level ofviolations, and the minimum conﬁdence. For minimum sizeand maximum deviation level, we ﬁxed the values at the defaultsused by J

ADET : The minimum size of a violated pattern wasset to as we are interested in violations independently oftheir size, and

10 000 for the maximum deviation level as weare interested in all violations, no matter how many temporalproperties are missing.

C. Experiments

We conducted several experiments to answer the questions:

RQ1:

To answer RQ1, we computed statistics on thescript models extracted, as well as patterns and violationsreported by J

ADET . Since the chosen approach to anomalymining has not been used in this context before, it is unclearwhat parameter values are best for the minimum support andminimum conﬁdence. We therefore conducted a sensitivityanalysis on these two parameters with minimum size ( = 2 )and maximum deviation level ( = 10 000 ) as ﬁxed variables,changing only minimum support and minimum conﬁdence. Forthe minimum support we tested the values (cid:104) , , , , (cid:105) ,where is the default J ADET value. For conﬁdence we testedthe values (cid:104) . , . , . . . , . (cid:105) . Intuitively, larger values for bothparameters are expected to produce higher quality anomalies;however, if the values are too large then there is a risk ofmissing relevant anomalies. Assuming a teaching scenario, wethus choose the conﬁguration with the highest possible valuesthat reports at least anomalies for each dataset. RQ2:

To answer RQ2, we investigated how correctness ofprograms relates to whether anomalies are reported. We useda manual classiﬁcation [38] of the

Monkey , Cat , Elephant , and

Horse datasets, for which the programs are small enough toallow a binary correct/incorrect classiﬁcation; only non-emptyprojects were classiﬁed. For the

Fruit dataset, we used thenumber of failed tests of the grading test suite used in priorwork [39] as a measurement of the degree of correctness, andcorrelated this to the number of anomalies reported. For the

Open dataset a classiﬁcation in correct/incorrect is not possible,since there was no speciﬁcation.

RQ3:

To answer RQ3 we manually classiﬁed the top- violations reported for each of the datasets. Two authors of thepaper independently classiﬁed each of the violations as either: • Defective:

The violation hints at a defect in a script thatstops it from working in the intended way. • Smelly:

The violation hints at a script that has qualityissues but does not break the functionality of the program. • Non-defective:

Adherence to the violated pattern would notcontribute to the functionality or quality of the program. To support objective classiﬁcation, we agreed on subcate-gories for every category above, by following the principlesof Qualitative Content Analysis [29]: One author inspected allviolations to classify and inductively developed subcategorieson different levels: Speciﬁc subcategories of the above and moreabstract subcategories, moving away from script and violationdetails. We discussed the resulting abstract subcategories withall the authors and agreed on the following subcategories: • Bug pattern (defective) : The violation hints at a defectthat a generic S

CRATCH linter such as L

ITTER B OX [10]could ﬁnd equally well. • Missing block (defective) : The violation hints at a missingproject-speciﬁc block. • Wrong order of blocks (defective) : The violation hints at ascript with the right blocks assembled in the wrong order. • Unnecessary block(s) (smelly) : The violation hints atblocks which are unnecessary, but do not change thefunctionality of the program. • Distinguished work (non-defective) : The violation doesnot hint at defects or smells in a script.During independent classiﬁcation by two of the authors,we inspected the full S

CRATCH program only if the scriptitself would not provide sufﬁcient information. We classiﬁedevery violation into one of the subcategories. Examples for thesubcategories are shown in Section IV-G.

D. Threats to Validity

As J

ADET was left unchanged in all areas that affect thecorrectness of the results, the main threat to internal validityis our own process that extracts the script models. To mitigatethis threat, we wrote automated tests to validate the correctnessof the script models it creates, and manually inspected a largenumber of script models in the development and classiﬁcationprocess. To avoid bias in the manual classiﬁcation process,we agreed on subcategories for every main category, forexample, bug pattern is a subcategory of defective . In theclassiﬁcation process, we assigned both the main categoriesand the—less subjective—subcategories to every violation. Inaddition, every violation was classiﬁed by two authors anddivergent assignments were discussed and resolved. Threats toexternal validity arise from our choice of parameters as well asthe datasets used. We evaluated the effects of the parameterson quantity and quality, but further studies will be necessaryto identify parameters that are acceptable for users. Besidesthe parameters, the quality of violations depends on variousproperties of the dataset it is applied to, such as the quality ofsubmissions or sizes, and our ﬁndings may not generalize toother datasets. However, our dataset covers different scenariosin terms of class sizes as well as programming tasks, and weexplicitly included closed as well as open tasks.

E. RQ1: Can anomalies be found in solutions?

Whether or not anomalies can be detected heavily depends onthe parameters of the mining procedure. To ﬁnd an appropriateparameterization to analyze script models extracted fromS

CRATCH programs, we conducted a sensitivity analysis; the

Minimum Support M i n i m u m C on f i den c e

30 40 50

Anomalies

Monkey

Minimum Support M i n i m u m C on f i den c e

10 15 20 25

Anomalies

Elephant

Minimum Support M i n i m u m C on f i den c e

40 60 80 100

Anomalies

Cat

717 111035 3 11135 17143 119 112 1119 2321 118 5 48 12119 1021 111737 111734 111637 10177

Minimum Support M i n i m u m C on f i den c e Anomalies

Horse

Minimum Support M i n i m u m C on f i den c e Anomalies

Fruit

Minimum Support M i n i m u m C on f i den c e Anomalies

Open

Fig. 6: Tuning results: Numbers of anomalies reported for each of the datasets with different conﬁgurations

Horse Fruit OpenMonkey Elephant Cat4 8 12 16 0 10 20 30 4 8 12 162 4 6 8 2 4 6 8 2 4 6 8012345602040600123020406080012345012 Pattern size C oun t Fig. 7: Pattern size distributions of our datasets

Horse Fruit OpenMonkey Elephant Cat0 1 2 3 0 20 40 60 80 0 2 4 6 8 100 1 2 3 4 5 6 0 1 2 0 1 2 3 4 50.00.20.40.60.81.00.00.20.40.60.81.00.00.20.40.60.81.00.00.20.40.60.81.00.00.20.40.60.81.00.00.20.40.60.81.0 Anomalies per project D en s i t y Fig. 8: Anomaly distributions of our datasetsresults are shown in Fig. 6. Based on this analysis, we chosea minimum support of and a minimum conﬁdence of . for all datasets except the Horse example, where there arefewer solutions and we therefore used minimum support and minimum conﬁdence . .Table II summarizes the results of the model extractionand anomaly detection for the chosen parameterization. Thenumber of models derived for each of the programs dependson the number of scripts in the solutions, and is thus roughlyproportional to the number of scripts in the solutions as TABLE II: Summarized characteristics of the solutions, by task Project described in Table I, with

Cat , Fruit , and

Open resulting in themost models. The number of patterns extracted is lower thanthe number of models in all but the

Fruit example. For the

Open example, the lower number of patterns is expected sincethere is more variety in the solutions, as students were freeto implement their own ideas. In the

Fruit game, on the otherhand, all students implemented the identical game. In contrastto the other four closed examples, there is some redundancywithin the scripts in each project, as the behavior of the appleand the banana sprites share several aspects—both drop fromrandom locations at the top of the stage to the bottom andcheck whether they touch the bowl or the bottom. This sharedbehavior contributes to the number of patterns found.Fig. 7 summarizes the sizes of these patterns. The majority ofpatterns are small, with only few temporal properties, althoughall projects have patterns of up to at least nine properties. Thelarger

Fruit task stands out with substantially larger patternsthan all other tasks. This is mainly a result of the overall sizeand complexity of the projects—see WMC and Statementsin Table I. Although the

Open task contains fairly complexsolutions, too, there is less overlap between these solutions,resulting in generally smaller patterns.These patterns tend to lead to multiple violations, as shown inTable II. However, only a subset of these violations are reportedas mentioned in Section III-C. The number of anomaliesgenerally is roughly proportional to the number of patterns,ranging from the conﬁgured minimum of ( Elephant ) to ( Fruit ). The number of projects that exhibit anomalies seemsto directly depend on the number of patterns extracted: For the

Elephant , Horse , Monkey , and

Cat tasks, less than % of theprojects had at least one anomaly. The Fruit task again standsout with more than half of the projects having reported at leastone anomaly. For % of the Open task solutions, at leastone anomaly is reported. Fig. 8 summarizes the distribution of A no m a li e s R epo r t ed Correct Solution Incorrect Solution (a) Anomalies reported. E rr oneou s P r o j e c t s Anomaly Reported No Anomaly (b) Incorrect solutions with anomalies.

Fig. 9: Relation of anomaly reports and correctness of solutions A no m a li e s r epo r t ed Fig. 10: Correlation between anomalies and failing testsanomalies over projects; most projects have only few anomaliesreported, although the

Fruit task is the exception with up to anomalies reported for a single project. RQ1 Summary.

The number of anomalies reported dependson the conﬁguration of the mining process. Our ﬁnalconﬁguration yields between and anomalies per task. F. RQ2: Do erroneous solutions lead to more anomalies?

Fig. 9a shows how many violations were reported oncorrect/incorrect programs for the

Monkey , Cat , Elephant ,and

Horse tasks. Overwhelmingly, the programs for whichanomalies were reported are incorrect solutions. The proportionof correct programs with anomalies is slightly higher for the

Monkey program. This program has two sprites (circus directorand monkey), whereas only the director is supposed to containscripts. Manual inspection showed that many anomalies aretriggered by additional code in the monkey sprite, whichwas not part of the speciﬁcation (see Section IV-G). Ascorrectness is a more ﬁne-grained question for the

Fruit example, Fig. 10 shows the correlation between anomaliesreported and tests failed. There is a weak correlation (Pearsoncorrelation coefﬁcient of . with p = 0 . ), demonstratingthat solutions with more errors tend to have more anomaliesreported, which supports the results on the other tasks.Fig. 9b shows how many of the incorrect programs hadanomalies reported. While for the Cat example half theincorrect programs had an anomaly reported, for the othertasks the proportion is lower. This is a result of the number ofpatterns and anomalies mined with our parameter settings, andlowering the conﬁdence or minimum support level would leadto more reported anomalies. However, lowering conﬁdence and

Minimum Support M i n i m u m C on f i den c e Ratio

Monkey

Minimum Support M i n i m u m C on f i den c e Ratio

Elephant

Minimum Support M i n i m u m C on f i den c e Ratio

Cat

Minimum Support M i n i m u m C on f i den c e Ratio

Horse

Fig. 11: Ratio of incorrect programs with anomalies reportedvs. total number of programs with anomalies.support thresholds may come at the price of more irrelevantanomalies: Fig. 11 shows the ratio of incorrect projects withanomalies reported to projects with anomalies reported in totalfor different parameter values. A higher ratio suggests a likelybetter quality of the reported anomalies, and Fig. 11 conﬁrmsthat higher conﬁdence and support increase the ratio.The number of missed incorrect solutions (Fig. 9b) isparticularly notable for the

Elephant example, where the overallnumber of incorrect student solutions is also higher: For

Elephant solutions to be considered correct, we required a forever loop with costume changes and wait blocks in between.Only student solutions used forever loops, while studentsolutions used repeat times loops instead, which we counted asincorrect. However, since this solution attempt is so common, itis unlikely to be reported as anomaly. In contrast, studentsused no loops at all, and thus their programs more likelyresult in anomalies. In general, if the dataset contains morediverse solutions, then fewer patterns will be found with highsupport. This suggests that different use cases may requiredifferent parameter settings. For example, formative assessmentat intermediate points may require different thresholds forsupport and conﬁdence than summative assessment at the end ofthe assignment. Note that a further 46 solutions to the Elephant task are empty, i.e., consist only of the hat-block provided asstarting point. Since at least two blocks are required in orderto form a temporal property, no anomalies are reported forsuch empty projects (Fig. 9 only shows non-empty projects).

RQ2 Summary.

There is a clear relationship between thenumber of anomalies and the correctness of a solution.

G. RQ3: Which categories of anomalies can be identiﬁed?

Fig. 12 summarizes the results of the manual classiﬁcationof the top ten anomalies for each of the datasets. In total, of the classiﬁed anomalies hint at defective code, with anomalies in the subcategory bug patterns, in thesubcategory missing blocks and in the subcategory wrongorder, anomalies hinted at smelly scripts with unnecessarycode, and hinted at non-defective, distinguished work.Except for the Open task, the majority of the detectedanomalies hint at defective code. Most defects ( out of )are project-speciﬁc problems that a generic linter would miss:Missing blocks and the wrong order of blocks. The anomaliesin the tasks Cat , Elephant and

Monkey predominantly showthat a speciﬁc block, which is essential for the solution of thetask, is missing in the student code.

Horse Fruit OpenMonkey Elephant Cat

Defective Smell Non−defective Defective Smell Non−defective Defective Smell Non−defective02468100246810 bug pattern missing block wrong order unnecessary block(s) distinguished work

Fig. 12: Results of the manual anomaly classiﬁcation (a) Wrong block use to react to touchesof the mouse pointer; similar in Fig. 1. when green ﬂagforever if-elseturn right change graphic effect (b) The anomaly hints at the missingblock in the script.

Fig. 13: The anomaly ranked third in the

Horse task. It belongsto the missing block subcategory, as the block responsible forthe required color change of the horse is missing.As an example for the missing block category, Fig. 13shows a student solution for the

Horse task which does nothave the block responsible for the required color change. Theanomaly shows the absence of this block, and therefore providesimportant feedback for both the teacher and the student. Thestudent can be made aware of the missing block and the teachercan use the anomaly as an opportunity to discuss in class whena task is considered solved.Fig. 14 shows a project-speciﬁc anomaly of the wrong ordersubcategory: Although the student’s solution for the

Cat taskcontains most of the blocks necessary for solving the task, theyare not in the correct order—the script is defective. To helpthe student, a teacher can address the script ﬂow in class ortrace the script step by step together with the student.Most of the anomalies in the subcategory bug pattern hintedat the bug patterns [10] Missing Loop Sensing (a conditionthat should be checked repeatedly in a loop is checked onlya single time), Forever Inside Loop (an inner inﬁnite loopprevents code in the outer loop from being reached) andTerminated Loop (a loop is unconditionally stopped after theﬁrst iteration) as implemented in L

ITTER B OX . Fig. 15 showsan anomaly of the Open task that shows a problem for whichL

ITTER B OX does not yet deﬁne a bug pattern, but which could (a) The script makes the cat say some-thing as soon as the game starts. when green ﬂagif-then say for secs (b) The anomaly shows that the sayblock should be used after the if block. Fig. 14: The top ranked anomaly in the

Cat task: It belongs tothe wrong order subcategory, as the correct solution requiresthe cat to speak after touching the ball. (a) The defective script ﬁrst hides thesprite and then tries to show when touch-ing another sprite. This does not work assprites can only touch when visible. when key pressedforever showhideif-then (b) The anomaly hints at theusual order of ﬁrst showing thesprite before hiding it.

Fig. 15: The anomaly ranked seventh in the

Open task. Itbelongs to the bug pattern subcategory, as the “Hide BeforeTouching” defect could be detected by linters in principle.be found by a generic checker: Before the script checks ifits sprite touches another sprite, the hide block is executed.However, while being invisible a sprite cannot touch othersprites, therefore the actions within the true branch of the if-then are never executed. Based on this anomaly, a teacher cannot only help the individual student and explain that the studentshould have used messages to coordinate the program ﬂow;the anomaly also provides clues about what misunderstandingsand misconceptions to touch upon in class.Besides anomalies that indicate defective code, thereare cases of smelly code with extraneous scripts or blocks thatdo not inﬂuence the program behavior, but negatively affect thecode quality. In the script in Fig. 16a, the student programmeda countdown using a timer variable and a conditional loopbreaking when the timer is equal to zero. Subsequently, thescript uses an if-then block to check if the timer equals zero.This block is redundant, since the conditional loop alreadydetermines the countdown to stop as soon as the timer is setto zero. Even if the anomaly in Fig. 16b does not explicitlystate that the if-then is redundant, it directs the attention tothe conditional. Building on such examples of smelly code,broader concepts from software engineering, such as codequality issues, can be incorporated into teaching.There are cases where there are scripts that triggeranomalies, even though the underlying code is not erroneous.0 (a) The script implements a countdownusing a timer variable and a condi-tional loop. As the value of the timerwill always be zero, the conditionalafter the loop is unnecessary. if-thenchange variable by (b) Even though the anomaly doesnot directly mark the conditional asunnecessary, it directs the attentionto the unusual usage and thereforethe smelly part of the script. Fig. 16: The anomaly ranked second in the

Fruit task. Itbelongs to the unnecessary block(s) subcategory, as the anomalyhints at a script that contains redundant blocks. (a) The script implements ananimation and works ﬁne. foreverhide (b) The anomaly suggests to adhere to a patternsimilar to the pattern violated in Fig. 15

Fig. 17: The anomaly ranked ninth in the

Open task belongsto the distinguished work subcategory, as satisfying the patternwould contribute to neither quality nor correctness.The majority of these are, unsurprisingly, in the

Open task,where students were free to implement games of their choice,based on common previous tasks. For example, althoughthe anomaly in Fig. 17 suggests to use a forever loop, theprogrammed animation is neither defective nor smelly andtherefore does not need to be changed. In the context of theclosed tasks, the anomalies either indicated that the studentsprogrammed something different from or additional to what wasrequired in the task, for example, added code to sprites whichusually are left empty. Both types of anomalies can be helpfulfor the classroom context since the student can be made awareof the actual task and the teacher can (if necessary) adjustteaching activities and pacing, or acknowledge and rewardcreative extensions of the tasks to encourage student creativity.

RQ3 Summary.

Out of classiﬁed anomalies, hintedat defective code, hinted at smelly code and hinted atdistinguished student work. All of these anomalies providevaluable feedback for teachers. V. R ELATED W ORK

Alternative approaches for analyzing S

CRATCH programs in-troduced in Section II-C include linting, testing, and veriﬁcation.All of these require some sort of prior, manual work—tests,checks or speciﬁcations, whereas anomaly detection requiresno manual work. Furthermore, in contrast to generic lintersanomaly detection can also ﬁnd project-speciﬁc bugs; on theother hand, anomalies may help to identify new, previouslyunknown generic checks to implement in linters, such as thehide-show defect (Fig. 15) we discovered in our analysis. Thequality and number of reported anomalies, however, dependson the underlying dataset, the number of students, and theiroverall progress in the programming assignment; these arefactors we plan to study in our future work.Our approach for anomaly detection in S

CRATCH is basedon the J

ADET tool, which is originally designed to analyzeobject usage models for J

AVA objects [44]. We chose thisapproach because approaches using the version history [26]are not applicable on S

CRATCH , and our motivation froman educational point of view is to ﬁnd anomalies in thetemporal relation of blocks, rather than relations betweenvariables and method calls [25] or patterns of interactionsof multiple objects [31]. However, many different anomalymining techniques for software have been proposed over theyears, and others may also be applicable to our speciﬁc domain.VI. C

ONCLUSIONS

With programming education becoming more prevalent, evenat earlier ages, there is an increasing demand for tools tosupport educators and learners. To the best of our knowledge,this paper is the ﬁrst proposal to use anomaly detection onS

CRATCH student code. Anomaly detection requires no manualspeciﬁcation effort, and, as our evaluation demonstrated, isnevertheless effective at ﬁnding relevant issues.Our initial investigation achieved promising results, butalso raised many interesting follow-up questions for futureinvestigation: The speciﬁc technique of anomaly detectionwe implemented has several parameters, and other anomalydetection techniques might be able to ﬁnd other or moreinteresting anomalies. Understanding what techniques andparameters lead to the results that are most helpful will requirefurther experiments, and a better understanding of when andhow teachers and learners would apply anomaly detection. Arelated question is how to best present anomalies to teachersand students in a way that helps them to understand the problemwith their code, and how to ﬁx it. Often, the pattern violatedby an anomaly may be able to serve as a hint on a correction.While programming and code quality are essential aspects ofsoftware engineering education, anomaly detection is applicableto any software engineering artifacts for which patterns can beformalized. It may therefore be possible to support educationwith respect to all phases of the software engineering life cycle.A

CKNOWLEDGEMENTS

This work is supported by DFG project FR 2955/3-1 “Testing,Debugging, and Repairing Blocks-based Programs”.1R

EFERENCES[1] E. Aivaloglou and F. Hermans, “How kids code and how we know: Anexploratory study on the Scratch repository,” in

ACM Conference onInternational Computing Education Research . ACM, 2016, pp. 53–61.[2] K. M. Ala-Mutka, “A survey of automated assessment approaches forprogramming assignments,”

Computer science education , vol. 15, no. 2,pp. 83–102, 2005.[3] M. Beller, G. Gousios, A. Panichella, S. Proksch, S. Amann, andA. Zaidman, “Developer testing in the ide: Patterns, beliefs, and behavior,”

IEEE Transactions on Software Engineering , vol. 45, no. 3, pp. 261–284,2017.[4] V. Blondeau, A. Etien, N. Anquetil, S. Cresson, P. Croisy, and S. Ducasse,“What are the testing habits of developers? A case study in a largeIT company,” in . IEEE, 2017, pp. 58–68.[5] B. Boe, C. Hill, M. Len, G. Dreschler, P. Conrad, and D. Franklin,“Hairball: Lint-inspired static analysis of scratch projects,” in

ACMTechnical Symposium on Computer Science Education . ACM, 2013,pp. 215–220.[6] Z. Chang, Y. Sun, T.-Y. Wu, and M. Guizani, “Scratch analysis Tool(SAT): a modern scratch project analysis tool based on ANTLR toassess computational thinking skills,” in . IEEE,2018, pp. 950–955.[7] C. Douce, D. Livingstone, and J. Orwell, “Automatic test-based assess-ment of programming: A review,”

Journal on Educational Resources inComputing (JERIC) , vol. 5, no. 3, p. 4, 2005.[8] S. Edwards, J. Spacco, and D. Hovemeyer, “Can Industrial-StrengthStatic Analysis Be Used to Help Students Who Are Struggling toComplete Programming Activities?” in

Proceedings of the 52nd HawaiiInternational Conference on System Sciences , 2019.[9] T. Eisenbarth, R. Koschke, and G. Vogel, “Static object trace extractionfor programs with pointers,”

Journal of Systems and Software , vol. 77,no. 3, pp. 263–284, 2005.[10] C. Frädrich, F. Obermüller, N. Körber, U. Heuer, and G. Fraser, “CommonBugs in Scratch Programs,” in

Proceedings of the 2020 ACM Conferenceon Innovation and Technology in Computer Science Education , ser.ITiCSE ’20. ACM, 2020.[11] K. Geldreich, A. Funke, and P. Hubwieser, “A programming circus forprimary schools,” in

ISSEP 2016 , 2016, pp. 49–50.[12] S. Grover, “Assessing Algorithmic and Computational Thinking in K-12: Lessons from a Middle School Classroom,” in

Emerging Research,Practice, and Policy on Computational Thinking , P. J. Rich and C. B.Hodges, Eds. Cham: Springer International Publishing, 2017, pp. 269–288.[13] S. Grover, V. Sedgwick, and K. Powers.[14] S. Gulwani, I. Radicek, and F. Zuleger, “Automated Clustering andProgram Repair for Introductory Programming Assignments,” arXivpreprint arXiv:1603.03165 , 2016.[15] M. Hassinen and H. Mäyrä, “Learning Programming by Programming: aCase Study,” in

Proceedings KolliCalling , A. Berglund and M. Wigbberg,Eds., 2006, pp. 117–119.[16] F. Hermans and E. Aivaloglou, “Do code smells hamper noviceprogramming? A controlled experiment on Scratch programs,” in

Int.Conference on Program Comprehension . IEEE, 2016, pp. 1–10.[17] F. Hermans, K. T. Stolee, and D. Hoepelman, “Smells in Block-BasedProgramming Languages,” in . IEEE, 2016, pp. 68–72.[18] J. E. Hopcroft, R. Motwani, and J. D. Ullman,

Introduction to automatatheory, languages, and computation, 3rd Edition , ser. Pearson interna-tional edition. Addison-Wesley, 2007.[19] D. Hovemeyer and W. Pugh, “Finding bugs is easy,”

ACM SIGPLANNotices , vol. 39, no. 12, pp. 92–106, 2004.[20] P. Hubwieser, J. Magenheim, A. Mühling, and A. Ruf, “Towardsa conceptualization of pedagogical content knowledge for computerscience,” in

ICER ’13 , B. Simon, A. Clear, and Q. Cutts, Eds. ACM,2013, p. 1.[21] P. Ihantola, T. Ahoniemi, V. Karavirta, and O. Seppälä, “Review of recentsystems for automatic assessment of programming assignments,” in

KoliCalling International Conference on Computing Education Research .ACM, 2010, pp. 86–93.[22] D. Insa and J. Silva, “Semi-Automatic Assessment of Unrestrained JavaCode,” in

Proceedings of the 2015 ACM Conference on Innovation and Technology in Computer Science Education , ser. ITiCSE ’15, V. Dagien ˙e,C. Schulte, and T. Jevsikova, Eds. ACM, 2015, pp. 39–44.[23] D. E. Johnson, “ITCH: Individual Testing of Computer Homeworkfor Scratch Assignments,” in

Proceedings of the 47th ACM TechnicalSymposium on Computing Science Education . ACM, 2016, pp. 223–227.[24] S. Krusche and A. Seitz, “ArTEMiS: An automatic assessment man-agement system for interactive learning,” in

Proceedings of the 49thACM Technical Symposium on Computer Science Education , 2018, pp.284–289.[25] Z. Li and Y. Zhou, “PR-Miner: automatically extracting implicitprogramming rules and detecting violations in large software code,”

ACM SIGSOFT Software Engineering Notes , vol. 30, no. 5, pp. 306–315,2005.[26] B. Livshits and T. Zimmermann, “Dynamine: ﬁnding common errorpatterns by mining software revision histories,”

ACM SIGSOFT SoftwareEngineering Notes , vol. 30, no. 5, pp. 296–305, 2005.[27] S. Magnusson, J. Krajcik, and H. Borko, “Nature, Sources, andDevelopment of Pedagogical Content Knowledge for Science Teach-ing,” in

Examining Pedagogical Content Knowledge , ser. Science &Technology Education Library, J. Gess-Newsome and N. G. Lederman,Eds. Dordrecht: Kluwer Academic Publishers, 2002, vol. 6, pp. 95–132.[28] J. Maloney, M. Resnick, N. Rusk, B. Silverman, and E. Eastmond, “TheScratch Programming Language and Environment,”

ACM Transactionson Computing Education (TOCE) , vol. 10, p. 16, 11 2010.[29] P. Mayring, “Qualitative content analysis: Theoretical background andprocedures,” in

Approaches to qualitative research in mathematicseducation . Springer, 2015, pp. 365–380.[30] O. Meerbaum-Salant, M. Armoni, and M. Ben-Ari, “Habits of program-ming in scratch,” in . ACM, 2011, pp. 168–172.[31] T. T. Nguyen, H. A. Nguyen, N. H. Pham, J. M. Al-Kofahi, andT. N. Nguyen, “Graph-based mining of multiple object usage patterns,”in

Proceedings of the 7th joint meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT symposium on theFoundations of Software Engineering , 2009, pp. 383–392.[32] F. Pecorelli, G. Catolino, F. Ferrucci, A. De Lucia, and F. Palomba,“Testing of Mobile Applications in the Wild: A Large-Scale EmpiricalStudy on Android Apps,” in

Proceedings of the 28th InternationalConference on Program Comprehension , 2020, pp. 296–307.[33] D. Perelman, S. Gulwani, and D. Grossman, “Test-driven synthesis forautomated feedback for introductory computer science assignments,”

Data Mining for Educational Assessment and Feedback (ASSESS 2014) ,2014.[34] R. Pham, S. Kiesling, O. Liskin, L. Singer, and K. Schneider, “Enablers,inhibitors, and perceptions of testing in novice software teams,” in

Proceedings of the 22nd ACM SIGSOFT International Symposium onFoundations of Software Engineering , 2014, pp. 30–40.[35] E. Rahimi, E. Barendsen, and I. Henze, “Identifying Students’ Miscon-ceptions on Basic Algorithmic Concepts Through Flowchart Analysis,”in

Informatics in Schools , V. Dagien ˙e and A. Hellas, Eds. Cham:Springer International Publishing, 2017, vol. 10696, pp. 155–168.[36] G. Robles, J. Moreno-León, E. Aivaloglou, and F. Hermans, “Softwareclones in scratch projects: On the presence of copy-and-paste incomputational thinking learning,” in . IEEE, 2017, pp. 1–7.[37] S. Sentance and A. Csizmadia, “Computing in the curriculum: Challengesand strategies from a teacher’s perspective,”

Education and InformationTechnologies , vol. 22, no. 2, pp. 469–495, 2017.[38] A. Stahlbauer, C. Frädrich, and G. Fraser, “Veriﬁed from Scratch: ProgramAnalysis for Learners’ Programs,” in

ASE . IEEE, 2020.[39] A. Stahlbauer, M. Kreis, and G. Fraser, “Testing scratch programsautomatically,” in

Proceedings of the 2019 27th ACM Joint Meetingon European Software Engineering Conference and Symposium on theFoundations of Software Engineering , 2019, pp. 165–175.[40] P. Techapalokul and E. Tilevich, “Quality Hound — An online codesmell analyzer for scratch programs,” in , Oct 2017, pp.337–338.[41] S. Wang, D. Chollak, D. Movshovitz-Attias, and L. Tan, “Bugram:bug detection with n-gram language models,” in

Proceedings of the 31stIEEE/ACM International Conference on Automated Software Engineering ,2016, pp. 708–719.[42] A. Wasylkowski, “Object Usage: Patterns and Anomalies,” Ph.D. disser-tation, Saarland University, 2010. [43] A. Wasylkowski and A. Zeller, “Mining temporal speciﬁcations fromobject usage,” Automated Software Engineering , vol. 18, no. 3-4, pp.263–292, 2011.[44] A. Wasylkowski, A. Zeller, and C. Lindig, “Detecting Object Usage Anomalies,” in