[PDF] Evaluating the robustness of source code plagiarism detection tools to pervasive plagiarism-hiding modifications

Abstract

Source code plagiarism is a common occurrence in undergraduate computer science education. In order to identify such cases, many source code plagiarism detection tools have been proposed. A source code plagiarism detection tool evaluates pairs of assignment submissions to detect indications of plagiarism. However, a plagiarising student will commonly apply plagiarism-hiding modifications to source code in an attempt to evade detection. Subsequently, prior work has implied that currently available source code plagiarism detection tools are not robust to the application of pervasive plagiarism-hiding modifications. In this article, 11 source code plagiarism detection tools are evaluated for robustness against plagiarism-hiding modifications. The tools are evaluated with data sets of simulated undergraduate plagiarism, constructed with source code modifications representative of undergraduate students. The results of the performed evaluations indicate that currently available source code plagiarism detection tools are not robust against modifications which apply fine-grained transformations to the source code structure. Of the evaluated tools, JPlag and Plaggie demonstrates the greatest robustness to different types of plagiarism-hiding modifications. However, the results also indicate that graph-based tools (specifically those that compare programs as program dependence graphs) show potentially greater robustness to pervasive plagiarism-hiding modifications.

Full PDF

NNoname manuscript No. (will be inserted by the editor)

Evaluating the robustness of source code plagiarismdetection tools to pervasive plagiarism-hidingmodiﬁcations

Hayden Cheers · Yuqing Lin · Shamus P.Smith

Received: date / Accepted: date

Abstract

Source code plagiarism is a common occurrence in undergraduate com-puter science education. In order to identify such cases, many source code pla-giarism detection tools have been proposed. A source code plagiarism detectiontool evaluates pairs of assignment submissions to detect indications of plagiarism.However, a plagiarising student will commonly apply plagiarism-hiding modiﬁca-tions to source code in an attempt to evade detection. Subsequently, prior workhas implied that currently available source code plagiarism detection tools arenot robust to the application of pervasive plagiarism-hiding modiﬁcations. In thisarticle, 11 source code plagiarism detection tools are evaluated for robustnessagainst plagiarism-hiding modiﬁcations. The tools are evaluated with data sets ofsimulated undergraduate plagiarism, constructed with source code modiﬁcationsrepresentative of undergraduate students. The results of the performed evalua-tions indicate that currently available source code plagiarism detection tools arenot robust against modiﬁcations which apply ﬁne-grained transformations to thesource code structure. Of the evaluated tools, JPlag and Plaggie demonstrates thegreatest robustness to diﬀerent types of plagiarism-hiding modiﬁcations. However,the results also indicate that graph-based tools (speciﬁcally those that compareprograms as program dependence graphs) show potentially greater robustness topervasive plagiarism-hiding modiﬁcations.

Keywords

Source code plagiarism detection · Source code similarity · Sourcecode modiﬁcation · Plagiarism hiding modiﬁcation

Hayden CheersThe School of Electrical Engineering & Computing,The University of Newcastle, Callaghan, NSW, AustraliaE-mail: [email protected] LinThe School of Electrical Engineering & ComputingThe University of Newcastle, Callaghan, NSW, AustraliaE-mail: [email protected] P. SmithThe School of Electrical Engineering & ComputingThe University of Newcastle, Callaghan, NSW, AustraliaE-mail: [email protected] a r X i v : . [ c s . S E ] F e b Hayden Cheers et al.

Plagiarism is a common occurrence in tertiary education (Joy and Luck 1999;Sheard et al. 2003; Yeo 2007; Sraka and Kaucic 2009; Curtis and Popal 2011;Pierce and Zilles 2017). In computing education, plagiarism is often encounteredas source code plagiarism. This is where one student has appropriated the sourcecode of another student (either as ﬁles or fragments), and submitted it as theirown work (Parker and Hamblen 1989; Joy and Luck 1999; Cosma and Joy 2008;Sraka and Kaucic 2009). In order to identify cases of source code plagiarism, manyautomated tools and techniques have been proposed in the form of Source CodePlagiarism Detection Tools (SCPDTs) (Joy and Luck 1999; Martins et al. 2014;Novak et al. 2019). A SCPDT evaluates assignment submissions for similarity inorder to identify suspiciously similar assignment pairs. A high similarity impliesplagiarism has occurred, mid-range similarity can imply students have collaboratedon an assignment, while low similarity does not raise suspicion of plagiarism.In order to hide plagiarism, a plagiarising student may apply source code mod-iﬁcations to reduce the similarity of the plagiarised work to its original. Such mod-iﬁcations serve to diﬀerentiate the source code such that it evades detection bya human reviewer or automated detection tool. In this work, source code modiﬁ-cations used to hide plagiarism are collectively referred to as plagiarism-hidingmodiﬁcations . Plagiarism-hiding modiﬁcations are commonly applied by either transforming the structure and appearance of the source code (e.g. by renamingidentiﬁers, shuﬄing statements, or modifying comments); or by injecting spuri-ous fragments of source code to give the appearance of distinct implementations(Faidhi and Robinson 1987; Whale 1990a; Prechelt et al. 2002; Freire et al. 2007;Granzer et al. 2013; Novak et al. 2019).In order to accurately detect plagiarism, a SCPDT must be robust againstplagiarism-hiding modiﬁcations.

Robustness is considered to be the ability of aSCPDT to withstand plagiarism-hiding modiﬁcations without decrease in the mea-surement of similarity. The robustness of SCPDTs can be compared relatively bythe impact plagiarism-hiding modiﬁcations have upon the measurement of similar-ity between a plagiarised work and its original. A SCPDT with greater robustnessto plagiarism-hiding modiﬁcations will evaluate a lesser decrease in similarity as aresult of the modiﬁcations; while a SCPDT that is vulnerable to plagiarism-hidingmodiﬁcations will evaluate a greater decrease in similarity.Robustness and accuracy are related but distinct qualities of a SCPDT. ASCPDT is robust when it can accommodate for plagiarism-hiding modiﬁcations.This will result in a SCPDT reporting a high similarity between a plagiarisedassignment with applied source code modiﬁcations and its source. Similarly, aSCPDT is accurate when it measures a high level of similarity between a plagia-rised assignment and its source to imply plagiarism is present; while also measuringa low similarity between unrelated works, implying that plagiarism is not present.Prior work has indicated that currently available SCPDTs are not robust to cer-tain plagiarism-hiding modiﬁcations, especially when pervasively applied (Schulze and Meyer 2013; Cheers et al. 2020).

Pervasive plagiarism-hiding modiﬁcations occur when plagiarism-hiding modiﬁcations are applied throughout the body ofplagiarised source code. This can result in many ﬁne-grained modiﬁcations to thestructure of the source code, overall resulting in a large decrease in measured sim-ilarity by a SCPDT. When pervasive plagiarism-hiding modiﬁcations are applied, valuating the robustness of source code plagiarism detection tools 3 it has been indicated that the similarity of plagiarised submission pairs can dropinto a range that does not warrant suspicion of plagiarism (Cheers et al. 2020).The research presented here evaluates the robustness of 11 SCPDTs againstpervasively-applied plagiarism-hiding modiﬁcations. This is to identify vulnerabil-ities of available SCPDTs, as well as to identify future potential directions of workfor the development of SCPDTs. Robustness is evaluated upon data sets of sim-ulated plagiarism representative of undergraduate students. Cases of simulatedplagiarism are generated with diﬀerent selections of source code modiﬁcations,with increasing pervasiveness of modiﬁcation (i.e. being increasingly modiﬁed).This allows for the evaluation of SCPDT robustness against diverse selections ofsource code modiﬁcations, along a sliding scale of pervasiveness of modiﬁcation.The remainder of this work is structured as followed. Section 2 presents back-ground on existing SCPDTs and plagiarism-hiding modiﬁcations. Section 3 presentsthe design of software tools used in this evaluation to accommodate for known de-ﬁciencies in the evaluation of SCPDTs. Section 4 presents the experimental designand describes the purpose of the evaluation performed in this work. Sections 5 and6 describe the setup of each evaluation and their performed experiments. Section7 discusses the results of the evaluations and reﬂects upon the robustness of theevaluated SCPDTs. Section 8 discusses threats and vulnerabilities from the resultsidentiﬁed in this work. Section 9 identiﬁes related works. And ﬁnally, Section 10concludes this research, and identiﬁes future directions of work.1.1 Research Questions & ContributionsThis work is guided by three research questions:

RQ1:

What are the impacts of source code transformations on SCPDTs?

Source code transformations typically apply cosmetic and structural changes tothe source code. However, the impact of speciﬁc source code transformations uponthe measurement of similarity is unclear. Certain SCPDTs and techniques may berobust to speciﬁc transformations, but signiﬁcantly impacted upon others. Hence,how are the SCPDTs aﬀected by speciﬁc source code transformations?

RQ2:

What is the impact of source code injection on SCPDTs?

Source codeinjection adds new fragments of source code to a program. This eﬀectively changesthe ‘size’ of the source code by introducing new elements for a SCPDT to compare.Injecting source code will undoubtedly lower the similarity scores of the comparedsubmissions. However, it is unclear if this is by a meaningful amount such that itcan hide indications of plagiarism. Furthermore, it is unknown if certain SCPDTsand techniques are more robust to the addition of source code fragments thanothers. Hence, how are SCPDTs aﬀected by source code injection?

RQ3:

What SCPDT is most robust to plagiarism-hiding modiﬁcations?

Whena program is modiﬁed with plagiarism-hiding source code modiﬁcations, it willcontain many changes to the structure of the source code. Hence, what SCPDT shows the least impact upon the evaluation of similarity in the presence of pervasiveplagiarism-hiding modiﬁcations?In the exploration of these research questions, four contributions are made: – A comprehensive evaluation of existing SCPDTs for robustness to plagiarism-hiding modiﬁcations.

Hayden Cheers et al. – The identiﬁcation of 5 speciﬁc source code transformations that have a largeimpact on the evaluation of source code similarity with existing SCPDTs. – The implementation of 6 simple SCPDTs used in this evaluation, 2 that aremodelled after unavailable SCPDTs. – A toolset for the generation and evaluation of simulated undergraduate sourcecode plagiarism data sets for similarity.

This section will ﬁrstly, provide background on existing SCPDTs, and secondly,identify commonly encountered examples of plagiarism-hiding source code modi-ﬁcations.2.1 Source Code Plagiarism Detection ToolsMany SCPDTs have been proposed to identify indications of plagiarism in pairs ofundergraduate assignment submissions. This is typically thorough the evaluationof source code similarity. Approaches to Source Code Plagiarism Detection (SCPD)can be broadly categorised by the aspects of source code compared to evaluatesimilarity: – Metric-based, – Text-based, – Token-based, – Tree-base, – Graph-based, or – Behavioural.Metric-based approaches count attributes of source code elements to identifysimilar documents. Metrics may include (but are not limited to) the number ofoperands, operators, declared variables or literals in code. Metric-based approacheshave been shown to be less eﬀective in comparison to structural approaches (Whale1990a,b; Kapser and Godfrey 2003), and as such are mostly considered historic.Faidhi and Robinson (1987) identiﬁed suspected plagiarism by counting 24 distinctcode metrics. Ottenstein (1976) compared source code with Halstead complexitymeasures (Halstead 1977) to identify programs with similar attribute counts. Morerecently, Shan et al. (2014) applied attribute counting and the chi-squared testmethod to evaluate program similarity.Text-based approaches analyse textual character strings extracted from sourcecode documents. Such approaches are typically applied to identify documents withshort edit distances or those with large quantities of overlapping sub-strings. Iden-tifying such cases imply the documents have a high degree of similarity, providingan indication of plagiarism. For example, two documents with a relatively short edit distance imply they have the same origin, but have been modiﬁed. Likewise,documents with many overlapping sub-strings share content in common. Sim-Gitchell (Gitchell and Tran 1999a,b) applies string alignment to measure similar-ity (similar in concept to overlapping sub-strings). Sherlock-Sydney (Pike n.d.)compares text ﬁles through the extraction of digital signatures. Such signatures valuating the robustness of source code plagiarism detection tools 5 are simply hashed word sequences extracted from the source documents. Thesesignatures are then compared for similarity. Rani and Singh (2018) proposed anextended Levenshtein String Edit Distance for measuring similarity.Token-based approaches represent a source code document as a stream of to-kens. A token is a lexically signiﬁcant term in a programming language. Tokensmay include (but are not limited to) identiﬁers, keywords, grammatical delimiters,and literal values. Streams of tokens can be compared for similarity using tech-niques similar to text-based approaches. It is also common to see approaches utilisea technique referred to as token tiling, where tokens strings from one program areplaced over another to identify the coverage of token sequences. Two notable token-based plagiarism detection tools include MOSS and JPlag. MOSS (Schleimer et al.2003) implements a winnowing algorithm to ﬁnd overlapping sub-strings of tokenhashes. JPlag (Prechelt et al. 2002) implements a greedy token tiling algorithmthat covers one source document with token sub-strings from another document.Sim-Grune (Grune and Huntjens 1989) is another similar token-based tool thatidentiﬁes similarity though common token sub-strings. This allows for the iden-tiﬁcation of similar segments of code that diﬀer in terms of layout, comments,identiﬁers and literal values. More recently, Anzai and Watanobe (2019) proposedan extended edit distance for the calculation of program similarity that takesinto consideration some of the common modiﬁcations applied by plagiarisers (e.g.changing order of blocks and statements).Tree-based approaches use a language-speciﬁc parser to construct AbstractSyntax Trees (AST). An AST represents the syntax of source code in a hierar-chical manner showing the grammatical structure of the source code. The AST isconstructed with a stream of lexical tokens extracted from a source code docu-ment. The structure of this tree can is compared for similarity. For example, byﬁnding isomorphic or similar sub-trees. Li and Zhong (2010) identiﬁed plagiarismby comparing AST structures. Zhao et al. (2015) identiﬁed plagiarism by com-paring the hash values of AST nodes. Tree-based approaches are known to suﬀerfrom a high computational complexity due to the complexity of tree comparisonalgorithms (Baxter et al. 1998).Graph-based approaches developed to represent the semantics of source code.This is commonly through the use of Program Dependence Graphs (PDG) (Fer-rante et al. 1987). Like tree-based approaches, graph-based approaches also suﬀerfrom high computational complexity in analysing similarity (Baxter et al. 1998).Liu et al. (2006), and Chen et al. (2010) implement such methods by evaluatingthe similarity of PDGs. These methods are claimed to be immune to plagiarism-hiding modiﬁcations such as statement reordering, and mapping statements tosemantic equivalents. Alternatively, Chae et al. (2013) use an API-labelled controlﬂow graph to identify similar sequences of API calls.In recent years there have been a few notable approaches that perform dynamicor symbolic analysis on programs to identify similarity. These approaches considerthe execution behaviour of a program in attempts to be robust to obfuscations.JIVE (Anjali et al. 2015) analyses the call tree of a program to ﬁnd similarity in method call sequences. The authors claim that the call tree is robust to obfusca-tions such as renaming and statement reordering. LoPD (Zhang et al. 2014) & Cop(Luo et al. 2017) attempt to identify the similarity of programs based on their im-plemented program logics to determine if they are semantically equivalent. VaPD(Jhi et al. 2011) identiﬁes program similarity through runtime execution analysis

Hayden Cheers et al. by identifying identical values stored in memory during execution. The foundationof VaPD is that from observation, certain runtime values of a program cannot bechanged through semantics-preserving obfuscations.There also exist hybrid tools that implement one or more of these approaches.Sherlock-Warwick (Joy and Luck 1999) implements both text and token-basedsimilarity measurements, combined with normalisation of the source code to re-duce any variation. Furthermore, there are also tools that implement other noveltechniques to identify similarity. Chen et al. (2004) evaluate program similaritythrough approximations of Kolmogorov Complexity (Kolmogorov 1998). Cosmaand Joy (2012) apply latent semantic analysis to match source code documentswith similar terms with PlaGate. While Karnalim (2016) identiﬁed plagiarismthough the analysis of Java Bytecode sequences.2.2 Plagiarism-Hiding ModiﬁcationsPlagiarism-hiding modiﬁcations diﬀerentiate a plagiarised program from its origi-nal in an attempt to hide the committed plagiarism. This work explores two typesof plagiarism-hiding modiﬁcations: – Source code transformation – Source code injectionSource code transformation occurs when the original source code is modiﬁedto appear diﬀerent. Such transformations are typically cosmetic or structural innature, and have no impact on the operation of the program. Behaviourally theplagiarised work will remain the same, however, the transformed program willappear diﬀerent to a human reviewer. In general, a source code transformation willnot introduce new statements into the source code, however it may split existingstatements where appropriate. Overall, the original source code is present, howeverit may take a structurally or cosmetically distinct form. Examples of source codetransformations used to hide plagiarism include: modifying comments, reorderingstatements or members, replacing control structures with equivalents, mappingexpressions to semantic equivalents, or renaming identiﬁers (Joy and Luck 1999;Jones 2001; Cosma and Joy 2008; Allyson et al. 2019).Source code injection refers to the addition of new or unrelated source codeto a program. In minor cases, this source code can be non-functional ‘junk’ (Jhiet al. 2011), and only serves to make a plagiarised program appear diﬀerent. Forexample, injected source code can consist of simple ‘print’ statements, or unusedvariable declarations Joy and Luck (1999). However, in more advanced cases theinjected source code itself may be plagiarised. For example, a plagiariser mayappropriate whole source ﬁles, classes, methods, or continuous blocks of code, andintegrate them into their own work. This constitutes ‘partial plagiarism’, whereonly fragments of a plagiarisers work is inappropriately sourced.

A plagiarising student may apply plagiarism-hiding modiﬁcations with diﬀer-ent intensity. This is subject to the skill of the plagiariser, and subsequently theeﬀort they apply to evade detection. In basic cases of plagiarism, an undergradu-ate plagiariser will have little understanding of a program or fragment of sourcecode that they have appropriated (Joy and Luck 1999). Hence, applied plagiarism-hiding modiﬁcations can be expected to be simple. For example, applying minor valuating the robustness of source code plagiarism detection tools 7 cosmetic changes such as reformatting code or modifying comments. However,there are potential cases where a more advanced plagiariser will begin to modifythe structure of the source code throughout the entire program. This can be byreordering declarations and statements in the source code, or potentially creat-ing new classes and methods. Furthermore, plagiarism can also be committed bymore advanced students with greater programming skills (Cheers et al. 2020). Forexample, consider a time-poor student who is proﬁcient at programming. Theymay apply many in-depth modiﬁcations to the source code, potentially perva-sively modifying it such that the plagiarised code no longer bares resemblance tothe original. In this case, the plagiariser has eﬀectively paraphrased (rewritten) another’s work to evade detection.Other works term plagiarism-hiding modiﬁcations as source code obfuscation(for example, Jhi et al. (2011); Zhang et al. (2014); Luo et al. (2017); Ko et al.(2017)). Plagiarism-hiding modiﬁcations are a form of source code obfuscation.However, obfuscation is typically applied to reduce the comprehension or under-standability of source code. Plagiarism-hiding modiﬁcations are not applied toreduce the comprehension of source code. The modiﬁed source code will still beunderstandable by a reviewer, but is expected to be superﬁcially distinct comparedto the original. Hence, plagiarism-hiding modiﬁcations can be seen as ‘lesser’ sourcecode obfuscations. In this work, the term plagiarism-hiding modiﬁcation is usedto refer to source code obfuscations that are representative of those applied byundergraduate plagiarisers to attempt evading detection.

From a review of existing evaluations of SCPDTs (Novak et al. 2019), two commondeﬁciencies can be identiﬁed:1. Tool availability - many evaluated SCPDTs are not made available for reuse.2. Data set availability - evaluations do not use reproducible or comprehensivedata sets.The ﬁrst deﬁciency of tool availability of is a signiﬁcant problem in the eval-uation of SCPDTs. Many proposed SCPDTs are simply not made available bytheir authors for reuse after initial publication (Novak et al. 2019; Cheers et al.2020); and without access to the proposed SCPDTs, it is diﬃcult to determine ifnewer approaches are suitable for use in the detection of plagiarism. For use inthis evaluation, only 6 SCPDTs were identiﬁed to be available for reuse: – MOSS (Schleimer et al. 2003) – JPlag (Prechelt et al. 2002) – Plaggie (Ahtiainen et al. 2006) – Sim-Grune (Grune and Huntjens 1989) – Sherlock-Warwick (Joy and Luck 1999) – Sherlock-Sydney (Pike n.d.)From a recent study of SCPDTs, Novak et al. (2019) conﬁrmed that these6 SCPDTs are commonly used in comparative evaluation of SCPDTs. Hence, itcan be assumed that these 6 tools have potential to be used in the detection ofplagiarism at academic institutions (that do not otherwise have their own internal

Hayden Cheers et al. tools, or use commercial alternatives). However, these tools only represent twomethods of measuring source code similarity. MOSS, JPlag, Plaggie and Sim-Grune utilise token-based similarity (or variants thereof). Sherlock-Sydney utilisestext-based similarity. While Sherlock-Warwick implements both text and token-based similarity. No SCPDTs that implement metric-based, tree-based, graph-based, or behavioural methods of SCPD are known to be available for reuse.The second deﬁciency of data set availability directly impacts upon the repro-ducibility and reliability of SCPDT evaluations. In the evaluation of SCPDTs, theutilised evaluation data sets are commonly either not provided, or not comprehen-sive enough for adequate evaluation. For example, Cheers et al. (2020) identiﬁedthat from a total of 17 SCPDT evaluations: – – –

10 did not provide the utilised data set (typically as they contain student as-signment submissions) (Pike n.d.; Grune and Huntjens 1989; Prechelt et al.2002; Schleimer et al. 2003; Chen et al. 2004; Jadalla and Elnagar 2008; Kus-tanto and Liem 2009; Cosma and Joy 2012; Anjali et al. 2015; Allyson et al.2019).It is diﬃcult to evaluate a SCPDT without a data set representative of under-graduate assignment submissions. The most appropriate data sets are collectionsundergraduate assignment submissions. This is as they represent the intended datato be evaluated by a SCPDT. However, as such data sets contain the works of stu-dents, they typically cannot be shared. This is due to privacy concerns (in sharingdata sets with known cases of plagiarism) and issues with the ownership of theassignments. For example, at the author’s institution, students retain ownershipof their assignments. Hence, they cannot be freely shared.It must also be considered the quality of any real data set of undergraduateassignment submission for use in bench marking SCPDTs. In any real data set,there is no guarantee that plagiarism exists (at least in a form that is readily de-tectable for ground-truth comparisons); let alone plagiarism with a diverse rangeof plagiarism-hiding modiﬁcations, that is sourced from students with diverse skillsets. Hence, evaluations of SCPDTs cannot be guaranteed to evaluate a SCPDTagainst a wide range of plagiarism-hiding transformations, or even identify perva-sively transformed cases of plagiarism.This section will present software tooling used in this evaluation designed toovercome and address these deﬁciencies. Section 3.1 presents the design of 6 naiveSCPDTs . This is to bring greater depth into the evaluation of SCPDTs by in-troducing SCPDTs that implement otherwise unavailable similarity measurement techniques. Section 3.2 presents the design of a data set generation tool,

SimPlag ,that can be used for a reproducible method of source code plagiarism detectiondata set generation. These tools are subsequently combined in Section 3.3 as anautomated SCPDT evaluation pipeline,

PrEP , that is used to facilitate the eval-uations performed in this work. valuating the robustness of source code plagiarism detection tools 9 naive

SCPDTsis presented. These tools are referred to as naive as they are simple implemen-tations of source code similarity measurement techniques applied to SCPD. Thenaive tools do not implement eﬃciency optimisations, or optimisations to gaingreater accuracy or robustness when analysing programs. Of the 6 naive tools, 4implement similarity measurement techniques of existing SCPDTs (as string andtoken-based tools), while 2 implement similarity measurement techniques of oth-erwise unavailable SCPDTs (as tree and graph-based tools). In combination thenaive SCPDTs will be used to measure a baseline robustness of the implementedtechniques for comparative purposes; and to add greater depth to the performedevaluations (as the number of currently available SCPDTs is limited). However,these tools do not solve the problem of tool availability, nor can their performancebe considered to represent that of similar techniques. The naive tools are sim-ply used as an indication of how robust and eﬀective these techniques may be atmeasuring similarity in the presence of plagiarism-hiding modiﬁcations.

SimilarityEvaluatorProg.Rep.Program 1 Prog.Rep.Program 2 Prog. Rep.SimilarityScores Prog. Rep.ScoreAggregator SimilarityScoreTransformerModule.java TransformerModule.java

Fig. 1

Common naive SCPDT pipeline. Implementations of the transformer and similarityevaluator modules are dependent on speciﬁc source code similarity measurement techniques.

The implementation of all naive tools share a common abstract pipeline, pre-sented in Fig. 1. The tools ﬁrstly accept as input a set of assignment submissionsas .java source ﬁles. Secondly, the source ﬁles in each assignment submission aretransformed into a speciﬁc program representation using a

Transformer Module .Thirdly, the similarity of all source ﬁle pairs between the two submissions aremeasured by the similarity of the derived program representations using a

Simi-larity Evaluator . Finally, the similarity score of the two assignment submissions isthen calculated by aggregating the ﬁle-wise similarity scores as the average of thebest-mapped ﬁle-wise similarity scores between the programs, shown in eqn. 1.

Sim ( A, B ) = | A | (cid:88) n =1 max ( { F Sim ( a n , b ) : b ∈ B } )+ | B | (cid:88) m =1 max ( { F Sim ( b m , a ) : a ∈ A } ) | A | + | B | (1)Where: public class Test {public static void main(String args[]) {PrintStream out = System.out;out.println("Hello, World");} (a) Sample ‘Hello World’ style program. public class Test { public staticvoid main(String args[]) { System.out.println("Hello, World"); } } PUBL CLAS IDNT LBRC PUBL STAT VOIDIDNT LPAR IDNT IDNT LBRK RBRK RPARLBRC IDNT DOTT IDNT DOTT IDNT LPARSTRN RPAR RBRC RBRC (b)

As plain text. (c)

As lexical tokens.

Compilation UnitTypesClassOrInterfaceTypename MethodDeclarationname parameter BlockVarDeclStmt CallStmtcallee name argsnametype value

ENTRYout =System.out out.println

Data DepControl Dep (d)

As AST (simpliﬁed for brevity) (e)

As PDG

Fig. 2

A Java “Hello, World” program representing as text string, token string, abstractsyntax tree, and program dependence graph. A , B are assignment submissions as .java source ﬁles a n , b m are .java source ﬁles a ∈ A , b ∈ B | A | , | B | is the number of source ﬁles in A, B F Sim ( a, b ) is the ﬁle-wise similarity of ﬁle pairs a , bmax ( X ) is the maximum value in set X The naive tools are constructed with four common program representations: – Text (string) – Token (string) – Abstract Syntax Tree (AST) – Program Dependence Graph (PDG)The text representation interprets a source ﬁle as a character string. This is derived simply by reading each source ﬁle line-by-line and appending it to a stringin memory. The token representation interprets a source ﬁle as a string of lexicaltokens. Each token represents a lexically-important term in a source ﬁle. Tokensare extracted with a tokeniser provided as part of the JavaParser framework . The https://github.com/javaparser/javaparser, last accessed May 1 2020.valuating the robustness of source code plagiarism detection tools 11 tree representation interprets a source ﬁle as an AST. Each AST represents thesource ﬁle as the syntactic structure of the Java programming language. ASTs areconstructed using the JavaParser framework. The graph representation interpretsa program as a set of PDGs (Ferrante et al. 1987). A PDG is constructed for eachmethod declared in the source code by analysing the relations between statements.Statements and data are represented by nodes in the PDG. Control edges arecreated between nodes to indicate dependencies on the execution of statements.Data edges are placed between statements and data nodes to identify commonreferenced values. These four representations are exempliﬁed in Fig. 2. The program representations derived from each ﬁle are compared using a

SimilarityEvaluator . Two basic type of similarity evaluation modules are implemented: – Edit distance – Greedy string tilingAll edit distance algorithms are conceptually the same irrespective of repre-sentation they operate upon. They simply evaluate the number of edits requiredto turn one program representation into another. For text and tokens, the ApacheCommons Text implementation of Levenshtein string edit distance is applied.Levenshtein string edit distance evaluates edit distance as the number of char-acter/token deletion, insertion or substitution to turn one sequence into another.Tree edit distance is evaluated with the Java library APTED . APTED implementsa robust and memory eﬃcient tree edit distance algorithm (Pawlik and Augsten2015, 2016), that is suitable for comparing many large ASTs. Graph edit distanceis evaluated using a recursive greedy edit distance algorithm to eﬃciently approx-imate the number of required edits to transform one PDG into another. With allfour representations, the edit-distance-derived similarity is calculated with eqn. 2. F Sim ed ( a, b, d ) = 1 − dmax ( | a | , | b | ) (2)Where: a , b are derived ﬁle representations d is the edit distance between a and b | a | , | b | is the size of each ﬁle representation max ( | a | , | b | ) is the maximum size of program representations a and b Greedy string tiling attempts to tile subsets of String A over String B to iden-tify the total coverage of A over B. A naive implementation of greedy tiling isapproximated by identifying all non-overlapping sub-strings between two docu-ments of length greater than n to measure approximate program coverage. Thesimilarity of two strings (representing source ﬁles) with the approximated greedystring tiling is calculated with eqn. 3 F Sim gst ( a, b, c ) = 2 × c | a | + | b | (3) http://commons.apache.org/proper/commons-text, last accessed May 1 2020. https://github.com/DatabaseGroup/apted, last accessed May 1 2020.2 Hayden Cheers et al. Where: a , b are derived string-based ﬁle representations c is number of covered string elements between a and b | a | , | b | is the size of each ﬁle representation Table 1

Naive source code plagiarism detection tools.Tool Name Program Representation Similarity EvaluatorString ED Text (String) Levenshtein Edit DistanceString Tile Text (String) Greedy String TilingToken ED Token Levenshtein Edit DistanceToken Tile Token Greedy String TilingTree ED AST Tree Edit DistanceGraph ED PDG Graph Edit Distance

A total of six naive SCPDTs were composed from the listed program represen-tations and similarity evaluators, listed in Table 1. In the performed evaluations,the naive tools will be used as a baseline for robustness for their respective sim-ilar approaches without optimisation, in comparison to the compared academicSCPDTs. The implementations of the naive tools are MIT licensed, and can befound at https://github.com/hjc851/NaiveSCPDTools .3.2 Data Set GenerationThe most appropriate method of rectifying the issue of data set availability wouldbe to release a ground truth data set of undergraduate assignment submissionsfor reuse. This would require identifying and labeling suspicious assignment pairs,and subsequently identifying how they are modiﬁed to enable correlation of scores.Such a data set would be similar to the code clone detection bench marking dataset BigCloneBench (Svajlenko and Roy 2015), that is a curated ground-truth dataset of code clones. However, a curated data set of undergraduate assignment sub-missions for plagiarism detection has many associated issues, notably regardingownership and quality of the data.It is common that a student will retain ownership of their assignment sub-missions. Subsequently, it would border on theft to use these data sets and sharethem amongst peers for use in evaluating SCPDTs. There are also legal obligations guaranteeing conﬁdentiality in student cases of plagiarism (applied to the author’sinstitution). Therefore, any data set would need to be expertly de-identiﬁed withno indication of the student, or even the source of the plagiarised works. It is alsoethically ambiguous if student works are used for the evaluation of SCPDTs for re-search purposes. Furthermore, there are issues of data set quality in real data sets valuating the robustness of source code plagiarism detection tools 13 sourced from undergraduate assignment submission. There are no guarantees thatin any arbitrary set of undergraduate assignment submission that there exist realcases of plagiarism, nor that they contain diverse examples of plagiarism-hidingmodiﬁcations.In the absence of a ground truth data set for SCPD, the generation of test datais an alternative option. This has been utilised in recent works for the evaluationof code similarity tools (Svajlenko et al. 2013; Ko et al. 2017; Ragkhitwetsagulet al. 2018). The basic idea of these tools is to apply source code modiﬁcations toa base program, and use it in the generation of test programs. This idea canbe reapplied for the generation of source code plagiarism detection data sets.However, the generated test data must be representative of undergraduate sourcecode plagiarism. Prior works have identiﬁed commonly applied modiﬁcations usedto hide source code plagiarism (for example, Faidhi and Robinson (1987); Joy andLuck (1999); Jones (2001); Mozgovoy (2006); Freire et al. (2007); Allyson et al.(2019)). Many of these modiﬁcations have in common that that they are appliedto change the structure and appearance of the source code, while retaining theoriginal semantics and behaviour. Such modiﬁcations can be automatically applied,and integrated into a tool that aﬀords the generation of simulated plagiarismrepresentative undergraduate plagiarisers.

Parser(Eclipse JDT)Base.java BaseAST ModificationFilterFilter Stack 1ModificationFilterFilter Stack n .. ( c l o n e s ) VariantAST AST Writer &Formatter Variant.javaVariant n Variant 1 Variant.javaVariant.javaVariant.javaVariantAST

Fig. 3

The SimPlag simulated plagiarism generator pipeline.

An automated tool implementing a source code modiﬁcation framework is de-veloped to generate simulated cases of source code plagiarism representative ofundergraduate programmers. This tool is titled

SimPlag : Simple Plagiarism Gen-erator, and is available at https://github.com/hjc851/SimPlag . SimPlag doesnot solve the issue of data set availability, however, it aids in providing a repro-ducible method of generating test data for comparing SCPDTs. Hence, this toolrepresents a step towards a solution to this issue.

Fig. 3 presents the architecture of SimPlag. SimPlag is implemented with apipe-and-ﬁlter design pattern to aﬀord a simple but extensible implementation.SimPlag can be scripted to automatically produce simulated plagiarised test pro-grams, that are generated with a conﬁgurable selection of source code modiﬁca-tions, that can be applied with a weighted random chance to aﬀect how pervasively modiﬁed the simulated plagiarised programs are. The operation of SimPlag is athree step process:1. Pre-processing2. Modiﬁcation3. SavingFirstly, SimPlag accepts as input a single assignment submission. This submis-sion is referred to as a ‘base’ program, that is used for the generation of multiplesimulated plagiarised ‘variant’ programs (referred to as variants hereafter). Sim-Plag will parse the source ﬁles of the base program into ASTs using the eclipseJava Development Tools . The AST of the base program is subsequently clonedonce per conﬁgured number of variants to be produced, resulting in a set of variantASTs in preparation for the application of source code modiﬁcations. Secondly,SimPlag will modify each clone of the base program to simulate plagiarism. Thetool will apply a stack of user-conﬁgurable Modiﬁcation Filters to the variantASTs. Each modiﬁcation ﬁlter applies an individual source code modiﬁcation tothe AST. The location and count of all applied source code modiﬁcations arerecorded for analytical purposes. Finally, the variant ASTs are saved, and writtento disk as simulated plagiarised variants of the original base program. As part ofthis process, a standardised formatting is applied to each program. This is aﬀordedwith the Google Java code formatting tool . By formatting the source code withthis tool, it is enforced that all produced variant programs are parsable, and can besubsequently analysed by all known available SCPDTs. Furthermore, the sourcecode will appear to be neatly formatted in an eﬀort to make the code appearsemi-realistic. A generated data set for source code plagiarism detection needs to simulate asclosely as possible a natural data set in terms of the types of modiﬁcation applied,and the presentation of the program in order to hide the plagiarism. It is logicalto conclude that applied plagiarism-hiding modiﬁcations must not be so advancedthat they are impossible for a novice programmer to apply, but not so simple thatthe programs remain eﬀectively unchanged. From reviewed literature, two distincttypes of source code modiﬁcations were identiﬁed: source code transformations,and source code injection. Many works reference examples of source code transfor-mations as being largely cosmetic and structural (Faidhi and Robinson 1987; Joyand Luck 1999; Jones 2001). While examples of injected fragments of source codeare typically small, self-contained, and non-functional; but also may result in codebeing mixed with self-written and plagiarised code (Freire et al. 2007; Granzeret al. 2013). This will be used as a guideline for the implementation of distinctsource code transformation and source code injection modiﬁcation ﬁlters.

Transformation Filters.

SimPlag implements a total of 14 source code transfor-mations, listed in table 2. Each source code transformation is implemented asan individual ‘transformation modiﬁcation ﬁlter’ (referred to as a transformation https://github.com/google/google-java-format, last accessed June 30 2020valuating the robustness of source code plagiarism detection tools 15 Table 2

Source code transformations implemented in SimPlag.Code Source Code Transformation Node(s) of InteresttAC Add comments Classes, methods, ﬁelds, statementstRC Remove comments Classes, methods, ﬁelds, statementstMC Mutate comments Classes, methods, ﬁelds, statementstRI Rename identiﬁers IdentiﬁerstRS Reorder statements (within methods) Block statementstRM Reorder class member declarations ClassesrSO Reorder expression operands Binary expressionstUD Up-cast primitive types Primitive type namestFW Swap for statement to while statement For statementstEA Expand compound assignment expressions Compound assignment expressionstEU Expand unary operator expressions Unary expressionstSV Split group variable declarations Group variable declarationstAD Assign default value to variable declaration Variable declarationstSD Split variable declaration and assignment Variable declarations ﬁlter). The source code transformations are applied using a simple tree-walk opera- tion to the ﬁlter’s nodes of interest (i.e. a node that it can be applied at). However,the transformations are not applied globally to each AST, but applied to randomlyselected nodes of interest, determined by a conﬁgurable transformation chance pa-rameter. The transformation chance indicates how likely each transformation ﬁlterwill be applied to a node of interest. For example, if a transformation ﬁlter is ap-plied with a 10% transformation chance, there is an approximate 10% chance thetransformation ﬁlter will be applied at any node of interest in the AST. Hence,this parameter is used as a conﬁgurable method of changing how pervasively eachsource code transformation is applied. This value is also used to indirectly rep-resent the ‘eﬀort’ a plagiariser may apply to modify a program and hide theirplagiarism. Higher modiﬁcation chances indicate the plagiariser has taken moretime and eﬀort to hide their plagiarism, while lower modiﬁcation changes implylesser time and eﬀort spent.SimPlag only implements source code transformations that are characteristicof being applied by undergraduate programmers, and have been listed in priorworks (Jones 2001; Mozgovoy 2006; Allyson et al. 2019; Freire et al. 2007; Granzeret al. 2013; Joy and Luck 1999; Faidhi and Robinson 1987; Karnalim 2016). Sim-Plag does not implement transformations that are considered too diﬃcult or un-representative of a novice programmer with little program understanding to apply.As such, the majority of implemented transformations are simple ‘swap’ or ‘map’type operations, as well as adding or removing non-functional code to existingcode. There are undoubtedly countless many source code transformations that could be implemented by SimPlag. For example, implementing swapping switch statements to if statements or in-lining method calls. However, the 14 implementedtransformations were selected as they are simple to implement, and hence they areconsidered to be characteristic of novice programmers. Furthermore, many of theimplemented transformations are provided as source code refactoring operations by Integrated Development Environments (IDE) such as Eclipse or IntelliJ IDEA .As such, it is conceivable a plagiariser with basic programming skills could use anIDE to facilitate the application of such source code transformations to hide theirplagiarism, with minimal eﬀort on their behalf. Injection Filters.

SimPlag implements injection modiﬁcation ﬁlters (referred to asinjection ﬁlters) to simulate cases where a student has copied code from a peer (e.g.through collaboration), or taken source code verbatim from a website. The toolsupports the injection of source code fragments into a program within 4 fragmentscopes: – Whole ﬁle – Whole class – Whole method – Whole statementThe diﬀerent scopes of source code injection represent diﬀerent severity ofverbatim source code copying. The severity is considered in terms of how hardthe plagiarised code is to detect, and for the plagiariser to integrate it into theirown work. In its simplest form, a plagiariser may copy an entire source ﬁle andsubmit it in their own work. This is the easiest form of plagiarism to detect, whilealso requiring the least eﬀort from the plagiariser. This is opposed to plagiarisingindividual or sequential statements of code. When statements are plagiarised andplaced within the plagiariser’s own source code, this becomes a much more diﬃculttask to detect, but also much more diﬃcult to integrate into their own work.Each injected fragment type of source code is implemented as its own distinctinjection ﬁlter. Like the transformation ﬁlters, injection ﬁlters have a conﬁgurable injection chance parameter. This parameter determines how likely each ﬁlter is toinject fragments of source code into the variant AST. However, injection modiﬁ-cation ﬁlters also require a secondary conﬁguration value, limiting the number oftimes it may inject code. This is largely as a quality concern to stop large quantitiesof source code fragments being injected into a comparatively smaller program. Alimit is deﬁned for all 4 types of injected source code fragments, limiting at most: f ﬁles can be injected into any variant, c classes can be injected into any ﬁle, m methods injected into any class, and s statements injected into any method. Thestatement injection ﬁlter has a variable upper limit on the number of statementsthat can be injected, prohibiting it from doubling the size of any existing method.All injection ﬁlters share access to a user conﬁgurable ‘seed pool’ of sourcecode ﬁles. The injection ﬁlters will use this pool for injecting source code into thevariant ASTs. On startup, each injection ﬁlter indexes the seed pool in terms ofits own required seeded fragments (i.e. the ﬁle injection ﬁlter will index ﬁles fromthe seed pool, etc) to allow for the later selection of fragments to be injected.The operation of the injection ﬁlters varies slightly to transformation ﬁlters.Class, method and statement injection ﬁlters operate in a similar manner to trans-formation ﬁlters. They are applied with a simple tree-walk, with the potential to inject a source code fragment at a node of interest. The class injection ﬁlter willinject a type declaration into any AST compilation unit (that represents an indi-vidual source ﬁle); the method injection ﬁlter will inject a method declaration into any AST class declaration; and the statement injection ﬁlter will inject a randomlyselected source code statement into an AST method declaration. At each node ofinterest, a random boolean value will be rolled (weighted by the ﬁlter’s injectionchance). If this boolean is true and the ﬁlter has not exceeded its pre-conﬁguredinjection limit, a fragment of source code will be randomly selected from the seedpool, and subsequently injected into the AST. However, the ﬁle injection ﬁlterhas a considerably diﬀerent operation. The ﬁle injection ﬁlter does not modify thevariant AST directly, but injects new ASTs into the variant. The ﬁle injectionﬁlter will roll a random boolean if it should be applied, and subsequently if true,inject up to f randomly selected ﬁles (as ASTs) into the variant. The source code modiﬁcations implemented by SimPlag seek to generate data setsof simulated plagiarism that are characteristic of three scenarios:1. A student has mis-appropriated another’s source code in whole, and disguisedplagiarism with source code transformations.2. A student has mis-appropriated fragments of another’s source code and injectedthem into their own work.3. Combinations of the above where fragments are injected into their own work,and subsequently transformed.The ﬁrst scenario is accommodated for using the source code transformationcapabilities of SimPlag. The second scenario is accommodated for with the sourcecode injection capabilities of SimPlag. While the third scenario is accommodatedfor using both the source code transformation and injection capabilities of SimPlag.It is argued that by only implementing source code modiﬁcations that are knownto be representative of undergraduate plagiarisers (Faidhi and Robinson 1987; Joyand Luck 1999; Mozgovoy 2006), the generated test data is semi-authentic andcan be used to represent similar cases of undergraduate source code plagiarism.However, the generated simulated plagiarised variant programs are synthetic testdata. While the implemented modiﬁcations applied are representative of under-graduate plagiarisers (as referenced from literature), and the base programs areintended to be sourced from real undergraduate assignments, the generated vari-ants are not real cases of plagiarism; and hence, they may be readily identiﬁableas synthetic to a human reviewer. This is ﬁrst and foremost the biggest limitationof SimPlag, and its use in the performed evaluations. Hence, SimPlag can onlybe used for evaluating SCPDTs against the implemented plagiarism-hiding modi-ﬁcations (the purpose of this work), and not for their use in the detection of real cases of plagiarism. However, this limitation is balanced by the ability to generatelarge quantities of simulated plagiarised test data with diverse plagiarism-hidingmodiﬁcations, as opposed to the collection of real cases of plagiarism that may ormay not contain substantial plagiarism-hiding modiﬁcations.

There are also important limitations to the functional correctness of the gener-ated test data by SimPlag. All test data produced by SimPlag is guaranteed to beparsable. Being parsable means the source code can be represented as an abstractsyntax tree, and therefore the source code is grammatically correct. However, Sim-Plag does not enforce that the generated simulated plagiarism is semantically or functionally correct. Developing a tool that can guarantee the functional correct-ness of code is diﬃcult and requires in-depth analysis of the source code and theimpact of modiﬁcations. As a result of this, the simulated plagiarised programsproduced by SimPlag may not compile, or have strange behaviour at runtime. Thislimitation is largely caused by the implementation of 2 source code transforma-tions: tRI, and tRS; as well as the source code injection modiﬁcations.The implementations of tRI and tRS are not guaranteed to invalidate thevariants. tRI will globally change user-deﬁned identiﬁers in the program (limitedto class, ﬁeld, method, parameter and local variable names), and avoids renamingtype names and variables declared in system libraries. However, full semantic anal-ysis is not performed, and as such, there is the potential for errors to occur. tRSdoes not analyse the dependencies between statements when reordering as it is im-plemented as simple shuﬄe of statements. This was an intentional design decisionas analysing statement dependencies would severely limit the number of state-ments that could be shuﬄed, and hence, impact on the number of times tRS couldbe applied. Hence, to simulate more invasive shuﬄing of statements, dependencyanalysis was omitted. Furthermore, having invalid simulated plagiarised programscan add to the realism of the applied plagiarism-hiding modiﬁcations. For exam-ple, consider by a novice programmer who applies the modiﬁcations without theskills to validate the correctness of the program, and consequentially invalidatesthe program. Such cases still need to be detected by SCPDTs.The injection ﬁlters are likely to invalidate the correctness of the variants tovarying extents. The ﬁle and class injection modiﬁcations should in theory haveno or minimal eﬀect on the validity on the generated variants, assuming thereare no naming conﬂicts caused by the injected ﬁles or class fragments. However,the method and statement injection ﬁlters are expected to potentially invalidatethe correctness of the variants, subject to the quality of the seed data. Both themethod and statement injection ﬁlters do not validate that the injected fragmentshave external dependencies (i.e. on the declaring class for methods, or within thedeclaring method for statements). These ﬁlters simply inject fragments of sourcecode into the variants, and hence, will invalidate the variants if the seed data hasexternal dependencies. This is again an intentional design decision. If all fragmentsof source code were required to be self-contained, it would risk impacting uponthe quality and complexity of injected source code fragments. This is being ableto inject complex source code fragments (that the plagiariser themself may notunderstand), as opposed to simple one-line fragments of source code with no realmeaning.However, the limits on the functional correctness of the simulated plagiarisedprograms are not expected to have a profound impact upon the evaluated SCPDTs.Of the 6 known available SCPDTs, none require the source code to parsable, letalone compiled and/or executed. Hence, it will not aﬀect the operation of theSCPDTs. Similarly, for the 6 naive SCPDTs, only Tree ED and Graph ED requirethe test programs to be parsable. However, Graph ED in particular is expected tobe impacted upon by modiﬁcations that aﬀect the semantics of the source code as

Graph ED measures the semantic similarity of the source code (as the relationsbetween terms) and not the structure of the source code. As tRI, tRS, and themethod and statement injection modiﬁcations have the potential to invalidatethe functional correctness of the program and by extension change the semanticrelations, it is expected that these modiﬁcations in particular will have a signiﬁcant valuating the robustness of source code plagiarism detection tools 19 impact upon Graph ED’s ability to evaluate similarity. However, these limitationsare mitigated by the need for conﬁdence in SCPDTs in that they can detectindications of plagiarism, even if the plagiarised work is not functionally correct.3.3 Evaluation PipelineTo facilitate the evaluations of SCPDT robustness performed in this work, areusable automated evaluation pipeline was developed. This pipeline is titled

PrEP: the Pr ogram E valuation P ipeline. PrEP was developed to automate thebatch evaluation of SCPDTs. The framework is implemented in Kotlin and runson the Java Virtual Machine. It utilises multi-processing to enable the structuredevaluation of tools in a manner that is easy to monitor and fault tolerant, butscalable based on computing resources. It exposes SCPDTs through Java bind-ings, enabling both Java and non-Java SCPDTs to be integrated into a singlepipeline. An important feature of this framework is the optional seeding of simu-lated plagiarised submissions through the integration of SimPlag. This integratesboth test data generation and evaluation into a single pipeline approach. The im-plementation of PrEP is available at https://github.com/hjc851/SCPDT-PrEP . Raw Data SetSeed Data Set Front End

Input

Data SetSeeding(SimPlag)

Seeding Detection

SCPDT 1SCPDT n . . . . . SimilarityReport

Reporting

Fig. 4

The PrEP tool evaluation pipeline framework.

The PrEP framework is structured as a modular pipeline to enable futureextension. Fig. 4 provides an overview of this pipeline. It is divided into fourphases:1. Input2. Seeding (Optional)3. Detection4. Reporting

Input is the provisioning of test data sets, tool conﬁguration (for both SimPlagand the utilised SCPDTs) and optionally seed data sets to the pipeline.

Seeding isthe addition of simulated plagiarised submissions into the test data set. The sim-ulated plagiarism is optionally sourced from a seed data set.

Detection is the pro- cess of invoking the integrated SCPDTs to evaluate the submissions for similarity.These are invoked on batch for the entire data set, evaluating both submission-wiseand ﬁle-wise similarities.

Reporting aggregates the scores from each tool for fur-ther comparison and evaluation. In addition to this, a list of simulated plagiarised https://kotlinlang.org, last accessed May 1 2020.0 Hayden Cheers et al. submissions and their applied source code modiﬁcations is generated to enable theaccurate analysis of results when used to evaluate SCPDTs. By integrating data set seeding into the framework, an evaluation data set canbe generated automatically as part of the evaluation process. SimPlag is usedto generate multiple variants of a single assignment submission that can enablebenchmarking against any selection of supported plagiarism-hiding modiﬁcations.Furthermore, the framework is able to oﬀer ground-truth evaluations. By gener-ating test data, it is already known what submissions are representative of pla-giarism, and how they are modiﬁed. This is something that cannot be achievedwith standard academic data sets without manual review. As a result of this, theframework aids in the accurate evaluation of tools, and the ability to correlate theimpact of each modiﬁcation on the evaluation of similarity.

A total of 11 SCPDTs are embedded in PrEP. Each tool is exposed through a Javabinding (i.e. as a Java class), enabling for procedural invocation of the tool fromJava code. The binding acts as an adapter, transforming the raw console outputof a tool into structured object values. The embedded tools are: – JPlag – Plaggie – Sim-Grune – Sherlock-Sydney – Sherlock-Warwick – The six naive tools (Table 1)All tools are exposed through three modes of comparison: ﬁle-wise, submission-wise, and batch. File-wise comparison lists the pairwise similarity scores of all ﬁlesin a pair of submissions. Submission-wise comparison provides the similarity scoreof two individual assignment submissions. Batch comparison lists the pairwisesimilarity scores of a set of submissions. Not all tools support all three modes ofcomparison natively. For example, Sim-Grune and Sherlock-Sydney only supportﬁle-wise comparison. The tool bindings accommodate for this where required byaggregating ﬁle-wise scores into submission-wise scores (as per eqn. 1).While MOSS is known to be an available SCPDT, PrEP omits embeddingMOSS into the pipeline. In pilot experiments, it was found that MOSS is unreliablein use for very large data sets. MOSS would often hang for long periods of timewhile processing the data sets (with each base program and variants submitted asone request), and often fail to provide a response. It could only be used reliably ona small number of comparisons. This is consistent with the observations of MOSSin prior work by Cheers et al. (2020). Hence, for reliability MOSS is omitted.

JPlag (Prechelt et al. 2002) operates by applying a token tiling algorithm to coverone source code ﬁle with tokens extracted from another. If two source ﬁles have a valuating the robustness of source code plagiarism detection tools 21 large degree of coverage, they can be considered similar and hence a candidate forplagiarism. First, source code ﬁles are converted into a stream of tokens. JPlag usesits own set of tokens which abstract standard language tokens to avoid matchingthe same token with diﬀerent meanings. Second, extracted tokens are comparedbetween ﬁles to determine similarity by the Running-Karp-Rabin Greedy-String-Tiling algorithm where tokens from one ﬁle are covered over another within atolerance of mis-match. Program similarity is evaluated as the percentage of tokensfrom one program which can be tiled over another program.Plaggie (Ahtiainen et al. 2006) is a tool that is claimed to operate similarly toJPlag. However, it is an entirely local application, compared to JPlag which wasoriginally provided as a web service. No known publication describes the operationof Plaggie; however from examining it’s implementation, it operates upon tokenisedrepresentations of the source code evaluating similarity by token tiling. Hence, itcan be assumed it has similar performance to JPlag.Sim-Grune (Grune and Huntjens 1989) analyses programs for structural simi-larity through the use of string alignment. For two programs, Sim will ﬁrst parsethe source code creating a parse tree. The tool will then represent the parse treesas strings and align them by inserting spaces to obtain a maximal common subsequence of their contained tokens. The similarity of programs is then evaluatedas the quantity of matches.Sherlock-Warwick (Joy and Luck 1999) implements both text and tokenisedcomparison methods. In the tool, a pair of programs are compared for similarity 5times: in their original form, with whitespace removed, with comments removed,with whitespace and comments removed, and as a tokenised ﬁle. In all cases,the comparisons measure similarity through the identiﬁcation of runs . A run is asequence of lines common to two ﬁles which may be interrupted by anomalies (e.g.extra lines).Sherlock-Sydney (Pike n.d.) analyses programs for lexical similarity. Digitalsignatures of source code are generated by hashing string token sequences (not lex-ical tokens) extracted from a text ﬁles. The digital signatures are then compared,with the similarity of ﬁles being evaluated as the number of digital signatures incommon.

Two distinct evaluations are performed in order to address the research questionsguiding this work. Firstly, the available SCPDTs are evaluated for robustnessagainst source code transformations in Section 5. Secondly, the available SCPDTsare evaluated for robustness against source code injection in Section 6. The twotypes of plagiarism-hiding modiﬁcations are evaluated separately as they have dif-ferent impacts on the source code. For example, transformations do not necessarilyadd code, but modiﬁes existing code; while injection adds new code not presentin the original program.

Both evaluations are broken down into distinct experimental cases, allowingfor the comparison of the SCPDT robustness against diﬀerent selections of sourcecode modiﬁcations. Diﬀerent intensities of modiﬁcation are used to evaluate theSCPDTs against progressively more pervasively modiﬁed cases of simulated plagia-rism. The intensity of applied plagiarism-hiding modiﬁcations is expressed through the transformation chance and injection chance parameters exposed by SimPlag.Test data generated with higher transformation/injection chances are expected tohave more modiﬁcations applied to the source code itself (i.e. more nodes of in-terest transformed, and larger quantities of source code injected), and hence suchvariants are considered to be more pervasively modiﬁed. This will allow for theevaluation of the SCPDTs on a sliding scale of pervasiveness of modiﬁcation.This section will provide an overview of the common design and setup of theperformed evaluations. Firstly, the scope of the evaluations is deﬁned. Secondly, themeasures used to compare the robustness of the evaluated SCPDTs is presented.Thirdly, the data sets used in the evaluation, along with the utilised test datageneration process is described. Fourthly, the conﬁguration of the utilised SCPDTsis deﬁned and justiﬁed.4.1 Scope of EvaluationsThe purpose of the performed evaluations are to compare the robustness of SCPDTsto plagiarism-hiding modiﬁcations. The evaluations are explicitly not designed toevaluate the accuracy of the compared SCPDTs in detecting instances of plagia-rism. An evaluation of accuracy typically compares tools by the number of sus-pected cases of plagiarism correctly identiﬁed. However, a limitation of SCPDTs isthat they do not speciﬁcally detect plagiarism. Instead, they detect indications ofplagiarism, with the identiﬁcation of plagiarism subject to human review (Joy andLuck 1999; Cosma and Joy 2012). Hence, as the purpose of this work is to eval-uate robustness (as a function of evaluated similarity) and not correct detectionsof simulated plagiarised works, no conclusion can be made from the results of theexperiments in regards to the accuracy of the evaluated SCPDTs. Evaluating theaccuracy of SCPDTs is subject to future work.The performed evaluations do not compare tools from similar domains such ascode clone detection (CCD). CCD and SCPD have much in common. For example,they both utilise similar techniques in evaluating source code similarity (Roy andCordy 2007; Ragkhitwetsagul et al. 2018), and use similar taxonomies of sourcecode transformations (see the 6-level taxonomy of Faidhi and Robinson (1987)and the common 3-type code clone taxonomy (Bellon et al. 2007)). As a result ofthis, it is common to see SCPDTs used in CCD evaluations (e.g. the evaluationsperformed by Burd and Bailey (2002); Schulze and Meyer (2013); Ragkhitwetsagulet al. (2018)). However, in this article, evaluating Code Clone Detection Tools(CCDT) is considered out of scope. This is fundamentally due to the performedexperiments being designed for the evaluation of SCPDTs and not CCDTs.The purpose of a CCDT is to detect similar fragments of source code. A CCDTis typically evaluated in its ability to detect smaller fragments of source code in-jected into another larger body of source code (e.g. as evaluated by Bellon et al.(2007)). The accuracy of the tool can then be measured in terms of how manyinjected fragments of source code are correctly identiﬁed. If a CCDT was to be evaluated for robustness to modiﬁcation, it is logical that source code modiﬁcationscould be applied to the injected code fragments themself. However, the performedexperiments are eﬀectively the opposite of this. Whole programs are cloned, andhave plagiarism-hiding modiﬁcations applied. There is no concept of a traditionalcode clone to be detected. Instead, these experiments are interested in the evalu- valuating the robustness of source code plagiarism detection tools 23 ation of overall program similarity, and the eﬀect source code modiﬁcations haveupon it. Hence, it would not be fair to evaluate CCDTs as there are no code clonesto detect in these experiments.In addition to this, there are no directly comparable metrics reported betweena SCPDT and CCDT for use in these experiments. In general, a CCDT does notreport the similarity of two programs, but the quantity of code clones in commonbetween them. In order to eﬀectively compare CCDTs as SCPDTs, a bridgingmechanism would have to be used to convert CCDT results into similarity scores.A simple method for this conversion would be to identify the coverage of identiﬁedcode clones over the size of the programs (e.g. as used by Ragkhitwetsagul et al.(2018)). However, this introduces a new dependent variable in the evaluations: howto calculate the size of a program? Identifying the optimal method for calculatingprogram size is itself a signiﬁcant undertaking. To focus the performed evaluations,it is considered outside of the scope of this work, and hence, only SCPDTs willbe evaluated. It is not to say CCDTs cannot or should not be evaluated in thedetection of plagiarism. As future work it would be interesting to compare SCPDTsand CCDTs in the presence of plagiarism-hiding modiﬁcations using a modiﬁedexperimental method that is fair to both tools. However, such an experiment wouldmost likely be focused on the measurement of tool accuracy in the presence ofplagiarism-hiding modiﬁcations. This is again, not a focus here.4.2 Evaluation Data Sets

Table 3

Base evaluation data set overview.AssignmentSet YearLevel AssignmentCount AverageLLOC Files Classes Methods StatementsAS1 1 223 389.39 3.72 3.71 34.43 500.29AS2 1 173 396.87 3.76 3.77 46.19 525.38AS3 2 73 225.03 5.03 5.15 29.42 294.74AS4 2 72 227.43 5.76 5.88 39.94 312.44AS5 3 17 46.59 2.65 2.71 9.18 165.65AS6 3 17 242.82 13.47 13.29 57.53 352.06Total - 575 336.03 4.41 4.44 37.96 440.99

Both evaluations utilise a large data set of undergraduate assignment submis-sions as base programs, and use these submissions to generate an even larger poolof plagiarised variants as test data. The base evaluation data set is comprised of 6sets of assignment submissions. These data sets are referred to as assignment sets1 through 6 (AS1 to AS6). Each assignment set contains undergraduate assign- ment submissions of varying size and complexity, representing a total of 3 yearsof undergraduate study, from students with varying skill levels, and assignmentsimplemented with varying technical complexity. Table 3 presents the average sizeand metrics of each assignment set, and the whole data set in total. For the pur-pose of expressing the size of each individual submission, the logical lines of code (LLOC) are used. This formulated from the distinct non-block statement count ineach program.Each individual experiment generates a distinct test data set of simulated pla-giarised variant programs with SimPlag. This allows for the generation of a largenumber of variants, thereby allowing for the identiﬁcation of the impacts of sourcecode modiﬁcations on average to a large sample of data. Each experiment’s testdata set is generated using a distinct selection of plagiarism-hiding modiﬁcations,inline with the goal of the experiment. 6 distinct test data sets are generated intotal. Evaluation 1 (section 5) applies combinations of source code transformationsacross 3 experiments, as well as one extended case. Evaluation 2 (section 6) appliescombinations of source code injection operations across 2 experiments. In each ex-periment, 5 variants of each base program from the source data set are created,repeated using 6 incremental transformation chance and injection change proba-bilities: 10%, 20%, 40%, 60%, 80%, and 100%. The lower probabilities will causeSimPlag to apply fewer source code modiﬁcations when generating test data, whilethe higher probabilities will cause more modiﬁcations to be applied. Using this testdata generation method, between 30 (5 variants at 6 chances of modiﬁcation) to420 (5 variants at 6 chances of modiﬁcation for each of the 14 transformations)variants of each base program are created in each experiment.4.3 Robustness MetricsRobustness is considered to be the ability of a SCPDT to withstand source codemodiﬁcation without decrease in the measurement of similarity. In order to com-pare the robustness of the evaluated SCPDTs, two comparison metrics are used.Firstly, the robustness of the SCPDTs will be compared using a quantitativemetric, measuring the average similarity of all variants compared to their respec-tive base programs. The quantitative metric is used to demonstrate the impactof individual source code modiﬁcations upon the generated test data sets. It iscalculated with eqn. 4.

AvgSim ( S ) = (cid:80) | S | i =1 S i | S | (4)Where: S is the set of similarity scores between each variant and it’s base program fora single SCPDT | S | is the number of similarity scores in S Using this quantitative metric, SCPDTs that measure a higher average similar-ity will be considered to be more robust to applied plagiarism-hiding modiﬁcations,while SCPDTs that measure a lower average similarity will be considered to beless robust to applied plagiarism-hiding modiﬁcations. This value will always be bound between 0 and 100, assuming all scores in S are also in this range. Hence,it will show on average the similarity decrease of the SCPDT as a result of appliedplagiarism-hiding modiﬁcations.Secondly, a comparative robustness metric is used to compare the robustness ofeach SCPDT when evaluating similarity on each test data set. This will compare valuating the robustness of source code plagiarism detection tools 25 the SCPDTs by the ratio of applied source code modiﬁcations to the total decreasein similarity occurred as a result of the plagiarism-hiding modiﬁcations. As theevaluations apply modiﬁcations in two forms (transformations and injection), twovariations of the comparative metric is introduced to compare SCPDT robustness: – Robustness to Code Transformation (RCT), and – Robustness to Code Injection (RCI).RCT measures the ratio of source code transformations applied to a programcompared to the decrease in measured similarity. This is expressed through eqn.5 as the number of transformations required to decrease the evaluated similarityby 1%.

RCT ( B, V, n ) = n − Sim ( B, V ) (5)Where: V is a variant of base program Bn is the greater than zero number of times source code transformations areapplied to B , transforming it into VSim ( B, V ) is the similarity of B and V evaluated with a SCPDTRCI measures the ratio of inserted lines of code (LOC) compared to the de-crease in measured similarity. This is expressed through eqn. 6 as the number oflines injected in order to decrease the evaluated similarity of a program by 1%. RCI ( B, V, l ) = l − Sim ( B, V ) (6)Where: V is a variant of base program Bl is the greater than zero LOC count injected into B , transforming it into VSim ( B, V ) is the similarity of B and V evaluated with a SCPDTThe comparative metrics are strictly for comparing the robustness of SCPDTson the same data set generated with the same method of applying plagiarism-hiding modiﬁcations. Both metrics are similar, however are required as the twotypes of modiﬁcation have diﬀerent impacts upon a body of source code (i.e. trans-formation changes existing source code, injection adds new fragments of sourcecode). Hence, this requires expressing the impacts of the modiﬁcations in terms ofthe aspects of source code modiﬁed. A higher RCT/RCI value will imply a SCPDTis more robust than a SCPDT with a lower RCT/RCI value, and vice-versa. Thisis reﬂected by more modiﬁcations needing to be applied in order to reduce theevaluated similarity by 1%. While a low RCT or RCI reﬂects that few modiﬁca-tions are required to reduce the evaluation similarity by 1%. However, they arenot normalised measures, and cannot be used as a universal determination of ro- bustness when comparing SCPDTs between diﬀerent data sets. Comparing theRCT and RCI scores for SCPDTs on diﬀerent data sets is not meaningful. Fur-thermore, these metrics only account for measuring robustness where there is atleast one source code modiﬁcation applied, that results in a decrease in similarity.Similarly, the two robustness equations should not be considered to imply that the number of modiﬁcations needed to reduce the similarity by 1% is a constant. It isunderstandable that when the similarity between the two programs is low, thenthe number of modiﬁcations needed to reduce the similarity by 1% is higher thanthe case then the two programs share greater commonality. The measured RC-T/RCI is simply a sample of the robustness of a SCPDT with a speciﬁc selectionof modiﬁcations applied. The performed experiments will evaluate the similarityof programs that are generated from the same base. In this case they share greatcommonality, thus, roughly speaking, the score provides a lower bound on thenumber of modiﬁcations needed to reduce similarity by 1%.4.4 Utilised SCPDTs & ConﬁgurationsThe performed evaluations compare the 11 SCPDTs integrated in PrEP. Theseconsist of the 5 academic SCPDTs: JPlag, Plaggie, Sim, Sherlock-Warwick andSherlock-Sydney; as well as the 6 naive SCPDTs: String Tile, String ED, To-ken Tile, Token ED, Tree ED and Graph ED. In the performed experiments, allSCPDTs are executed with their default conﬁguration parameters. While priorworks have indicated that code similarity tools in general can gain greater perfor-mance by the selection of optimal conﬁguration parameters (Ragkhitwetsagul et al.2018; Ahadi and Mathieson 2019); it is argued that this is not representative of areal-world use of SCPDTs. An academic using a SCPDT will not know in advancethe optimal conﬁguration values for any given tool when assessing any arbitrarydata set. Furthermore, time will not typically permit an academic to evaluate adata set for plagiarism using multiple tool conﬁgurations to ﬁnd the best possibleresult, as again, the best possible result is not known in advance. Hence, this workassumes that the original authors of the evaluated SCPDTs have selected defaultconﬁguration parameters that produce on average acceptable results.The naive String Tile and Token Tile tools both require specifying minimummatch lengths. String Tile utilises n = 20 as this is the approximate average lengthof expressions in the base data set (being approximately 16, with a buﬀer of 2 char-acters each side to avoid false positives). This will cause the tool to match sub-strings of at least this length. Token Tile utilises n = 12. This is justiﬁed as priorworks have suggested values similar to this for token-based tools (e.g. Ragkhitwet-sagul et al. (2018); Ahadi and Mathieson (2019)), and that JPlag utilises this samedefault value (Prechelt et al. 2002). The remaining naive SCPDTs (all being editdistance based) do not support conﬁguration, as all edits have a hard-coded weightof 1. This evaluation will compare the robustness of the SCPDTs to source code trans-formations. Due to the methods of measuring similarity, some tools are by design more robust to certain transformations. For example, token-based tools ignorecommenting and general formatting, hence they are not impacted by such trans-formations. However, there are numerous source code transformations that can beapplied to hide plagiarism, each of which may have a substantial impact on themeasurement of similarity. valuating the robustness of source code plagiarism detection tools 27

Three experiments are performed to compare the robustness of the selectedtools to speciﬁc selections of source code transformations. Firstly, the tools areevaluated using data sets created with each transformation applied in isolation.Secondly, the tools are evaluated using random selections of transformations, overmultiple iterations. Thirdly, they are evaluated by applying all transformations inunison. Ideally, each individual selection of transformations would be evaluatedin isolation. However, with 14 transformations this leads to 16,383 unique selec-tions, each requiring evaluation separately on each program in the data set. Thisis computationally infeasible for this evaluation, as it would require generating282,606,750 variants (from the 575 base programs, create 5 variants at each of the6 transformation chances for each selection of transformations), requiring compar-ison with each of the 11 SCPDTs. As such, random selections of transformationsare used to gain coverage in the evaluated selections of transformations.The three experiments are designed to progressively test robustness to sourcecode transformation by combining diﬀerent types of transformations. The ﬁrstexperiment accommodates for the simplest case with minimal transformations, re-stricted to only one type of source code transformation applied to each variant.The second experiment simulates more realistic cases of plagiarism, where a pla-giariser transforms a program with various types of transformations. Thirdly, alltransformations are applied to simulate an extreme case of plagiarism with all sup-ported source code transformations. These three experiments provide coverage toidentify what transformations are most eﬀective at reducing the similarity of eachtool. Doing so will allow for identifying what transformations each tool is mostrobust against, and subsequently what transformations they are most vulnerableto; and hence explore RQ1.The 14 source code transformations are always applied in the same order aslisted in Table 2. This decision was made to eliminate the complexity introducedthrough the number of permutations of transformations if the ordering of applica-tion was conﬁgurable. This is deemed acceptable as there are no direct interactionsbetween the transformations, except in two cases:1. Add comment, Remove comment, & Modify comment2. Assign default value to variable, & Split variable declaration and assignmentThese interactions will have the greatest impact at the 100% transformationchance. The ﬁrst case will cause all comments to be removed. This in theory willonly aﬀect the string-based SCPDTs, and in a worst case increase the averagesimilarity scores reported by such tools, as a point of variation is removed fromthe generated variants. In the second case, it will result in all variable declarationshaving no assigned default value, with all variables being the target of a assignmentexpression after declaration. However, it is not expected that these interactionswill have a profound impact on results even in the most extreme case. caused by each transformation to be traced as the probability the transformationis applied at any node of interest is also increased. The generated test data setfor this experiment contains 241,500 variant programs. This is broken down into5 variants for each base program, with each of the 14 transformations applied inisolation, generated with the 6 transformation chances.

10% Transformation Chance t A C t RC t M C t R I t R S t R M t S O t UD t F W t E A t E U t S V t AD t S D Source Code TransformationJPlagPlaggieSimSherlock-WSherlock-SString TileString EDToken TileToken EDTree EDGraph ED D e t ec ti on T oo l

20% Transformation Chance t A C t RC t M C t R I t R S t R M t S O t UD t F W t E A t E U t S V t AD t S D Source Code TransformationJPlagPlaggieSimSherlock-WSherlock-SString TileString EDToken TileToken EDTree EDGraph ED D e t ec ti on T oo l

40% Transformation Chance t A C t RC t M C t R I t R S t R M t S O t UD t F W t E A t E U t S V t AD t S D Source Code TransformationJPlagPlaggieSimSherlock-WSherlock-SString TileString EDToken TileToken EDTree EDGraph ED D e t ec ti on T oo l

60% Transformation Chance t A C t RC t M C t R I t R S t R M t S O t UD t F W t E A t E U t S V t AD t S D Source Code TransformationJPlagPlaggieSimSherlock-WSherlock-SString TileString EDToken TileToken EDTree EDGraph ED D e t ec ti on T oo l

80% Transformation Chance t A C t RC t M C t R I t R S t R M t S O t UD t F W t E A t E U t S V t AD t S D Source Code TransformationJPlagPlaggieSimSherlock-WSherlock-SString TileString EDToken TileToken EDTree EDGraph ED D e t ec ti on T oo l t A C t RC t M C t R I t R S t R M t S O t UD t F W t E A t E U t S V t AD t S D Source Code TransformationJPlagPlaggieSimSherlock-WSherlock-SString TileString EDToken TileToken EDTree EDGraph ED D e t ec ti on T oo l Average Variant Similarity0% 25% 50% 75% 100%

Fig. 5

Heat maps representing the average similarity of variants generated with each individ-ual source code source code transformation. Darker colours indicate a higher similarity scores,and hence higher robustness to the applied transformation.valuating the robustness of source code plagiarism detection tools 29

Fig. 5 presents the average similarity of the variants created with each individ-ual transformation at the 6 transformation chances. Six heat maps are presented,one for each transformation chance. Darker colours indicate a higher average levelof similarity, while lighter colours indicate lower evaluated average similarity. Fromthis ﬁgure, it is clear that the string-based tools (Sherlock-Sydney, String Tile,String ED) are not robust to the application of any source code transformations.This is indicated by the consistent light-coloured rows for these tools. Sherlock-Sydney demonstrates the largest vulnerability to all transformations across alltransformation chances. Furthermore, the String Tile, and String ED tools showsimilarly consistent deceases in similarity. This implies the string-based tools areall impacted upon by any applied transformation. Sherlock-Warwick also suﬀersa noticeable decrease in similarity across all transformations as it does rely uponstring-based metrics to evaluate similarity. However, it’s drop in similarity is notas prevalent as the string-only tools, presumably as it integrates token-based sim-ilarity measurement.The results for the non-string-based tools provide insight into the vulnerabili-ties of token, tree and graph-based techniques. Starting at the 60% transformationchance and continuing until 100%, a trend of common vulnerabilities are demon-strated by the lighter columns in the heat maps. While the lighter columns arenot consistent amongst the non-string-based tools (implying certain tools showgreater robustness to certain transformations), a common trend can be seen bythe analysis of the scores with the greatest decrease in similarity.

Table 4

Ranking of transformations applied to non-string-based SCPDTs at the 100% trans-formation chance. Transformations are ranked left (greatest decrease) to right (least decrease).Horizontal bars ‘—’ delimit transformations thath have negligible decrease in similarity.Tool Individual Transformation RankingsJPlag tRM tRS tFW tSD — tEA tRI tEU tSV tSO tAD tRC tUD tMC tACPlaggie tRM tRS tSD tFW — tEU tRC tRI tSO tSV tEA tMC tUD tAD tACSim tRM tSO tRS tSD tFW tEA — tEU tAD tRI tSV tRC tMC tAC tUDToken Tile tRM tRS tSO tSD tFW — tAD tEU tEA tSV tUD tAC tMC tRI tRCToken ED tRM tRS tSO tFW — tSD tEA tAD tSV tEU tRI tMC tUD tRC tACTree ED tRM tRS tSD tSO tFW — tEA tSV tEU tAD tRC tMC tRI tUD tACGraph ED tRI tSD tFW tRS — tSV tEA tSO tEU tUD tAD tRC tAC tRM tMC Table 4 presents the rankings of each transformation by greatest impact uponeach non-string-based SCPDT’s similarity scores at the 100% transformation chance.The horizontal bar (i.e. ‘—’) delimits transformations that incur a decrease in sim-ilarity (on left) from those that incur a negligible decrease in similarity. A commontrend of the scores on the right of this delimiter is that they fall within 96% ± similarity. All scores to the left fall below this point, and in some cases, this is bya large margin. Initially it would be expected that transformations with no impactwould evaluate a 100% similarity score, however due to the operation of each tooland potential noise in the data sets, transformations with no impact typically havean average similarity of approximately 96% over the generated test data set. There is no consistent ordering of transformations that the tools are vulnerableto (listed left of the bar delimiter). Hence, it can be implied that each tool is morerobust against certain transformations. However, for the token and tree-basedtools, the 5 most common transformations ranked ﬁrst are: – tRS (Reorder statements (within methods)) – tRM (Reorder class member declarations) – tSO (Swap expression operands) – tFW (Swap for statement to while statement) – tSD (Split variable declaration and initial assignment)This decrease for the 5 transformations for the token and tree-based tools, canbe attributed to ﬁne-grained re-orderings of token sequences. The operation of thetoken-based tools varies in terms of how they match token sequences. However, ingeneral, to match token sequences, a minimum number of tokens must be matched.If a source ﬁle becomes segmented with mis-matched token sequences smaller thanthis minimum number, it can stop these tools from accurately identifying commontoken sequences. For example, the Token Tile tool speciﬁes a minimum matchlength of 12 tokens. For any modiﬁcations less than 12 tokens apart, the tool willno longer be able to match the aﬀected sub-sequence between these modiﬁcations.Hence, the small re-orderings of tokens applied in these 5 transformations can havea cumulatively large impact on token-based tools.The Tree ED tool sees a similar decrease in similarity to these 5 transforma-tions, that is explained by a similar cause to the token-based tools vulnerability.The ﬁne-grained reordering of token sequences is analogous to ﬁne-grained restruc-turing of an AST. When making large changes to the structure of the tree, theedit distance of the tree also becomes larger, resulting in a lower evaluated simi-larity. However, the tree edit distance algorithm is comparatively more proﬁcientat handling the lesser more ﬁne-grained statement reordering with tRS; while it isalso more vulnerable to the large changes introduced with tRM when comparingagainst the token-based tools. This greater vulnerability is aﬀected by the naiveimplementation of the approach. While in an optimal implementation of an ASTedit distance tool this may not be an issue; the utilised implementation utilisesa Greedy edit distance algorithm. Hence it is subject to errors in the calculationof the optimal edit distance and there are corner cases when a disproportionaldecrease in similarity can be experienced.The Graph ED tool does not suﬀer from a large decrease in similarity fromall 5 transformations. This is as it focuses on the semantics of the source code,and not the structure. However, it does suﬀer from a large decrease in similaritywhen evaluated against tRI (rename identiﬁers), tFW and tSD; and to a lesserdegree tRS. In the case of tRI, this is an unexpected result. A PDG-based toolshould not see a large decrease from identiﬁer renaming, as it does not changethe semantics of the program. Likewise, tFW and tSD should not impact uponsimilarity as these are semantics-preserving operations. Upon investigation, thisdecrease in similarity is caused by the implemented edit distance algorithm. Being a Greedy implementation, it does suﬀer from a decrease in accuracy. Furthermore,it compares graph nodes by the type of statement it contains. This will therefore seea reduction in similarity caused by modiﬁcations that transform statements. In thecase of tRS, the implementation of this transformation can modify the semanticsof the source code unintentionally. tRS is implemented as a literal shuﬄing of valuating the robustness of source code plagiarism detection tools 31 the statements in code. Hence, it can change the control dependencies in theconstructed PDG. Due to these factors, stemming from its naive implementation,the Graph ED tool does suﬀer from a decrease in similarity.Other notable modiﬁcations that are ranked to the left of the bar (or on theborder) include tEA (Expand compound assignment) and tEU (Expand unary ex-pression). However, they do not consistently appear to the left of the bar delimiterfor all tools. Hence, they are omitted from this list to focus on the transformationswith the greatest average impact. However, it should be noted that these transfor-mations apply similar ﬁne-grained reordering of the source code token sequences.Hence, they should be considered to also be a potential vulnerability of the tokenand tree-based tools as they share similar characteristics.In all cases, this is a clear result. The most eﬀective method of impacting upona tool is to change the representation of the program in which the tool measuressimilarity upon. However, it does emphasise that these tools are profoundly vul-nerable to such simple transformations. The ﬁve identiﬁed transformations arenot technically complex to implement - and in many cases can be automated bysource code editors. Furthermore, when applying all of these transformations inunison it is conceivable that there would be a profound impact upon the tokenand tree-based tools; and potentially even the graph-based tool.Overall, the results of this experiment are positive. Firstly, it has reinforcedthat string-based tools are not robust to any source code transformations. Sec-ondly, it has reinforced that token and tree-based tools are more robust to certainsource code transformations, except for those which apply ﬁne-grained modiﬁca-tions to the structure of the source code. This is the most signiﬁcant impact, asit shows that while token-based tools are more robust to string-based tools, theycan still be fooled with simple re-orderings of source code. While the impact of thetransformations varies across tools; not being robust to ﬁne-grained reordering ofsource code is a signiﬁcant deﬁciency as this is a common method students useto hide plagiarism. Finally, this experiment has shown that the graph-based toolis more robust to certain source code transformations. However, as discussed, theimplementation is not optimal and suﬀers from a decrease in similarity. This isa major deﬁciency of the approach; however, this can be attributed to a naiveimplementation of a PDG-based tool, and as such these results cannot be claimedto be representative of all other PDG-based approaches. A large impact on the evaluated similarity was found when applying the ﬁveindividual transformations: tRS, tRM, tSO, tFW, tSD. This subsection extendsthe isolated transformation experiment to apply these 5 speciﬁc transformations inunison to a generated test data set. This is to compare the robustness of the toolsto the application of these ﬁve transformations, that each apply ﬁne-grained sub-sequence reordering. A second test data set is generated containing 17,250 variants, broken down into 5 variants for each base program (with the 5 transformationsapplied), generated with each of the 6 transformation chances.Fig. 6 presents the average similarity of the generated variants as the chanceof transformation increases. This ﬁgure reinforces the results from Fig. 5 in thatthese ﬁve transformations have a signiﬁcant impact upon the evaluation of source J P l ag P l agg i e S i m Sh e r l o c k - W Sh e r l o c k - S S t r i n g T il e S t r i n g E D T o k e n T il e T o k e n E D T r ee E D G r a ph E D S i m il a r i t y ( % )

10% Transformation Chance J P l ag P l agg i e S i m Sh e r l o c k - W Sh e r l o c k - S S t r i n g T il e S t r i n g E D T o k e n T il e T o k e n E D T r ee E D G r a ph E D S i m il a r i t y ( % )

20% Transformation Chance J P l ag P l agg i e S i m Sh e r l o c k - W Sh e r l o c k - S S t r i n g T il e S t r i n g E D T o k e n T il e T o k e n E D T r ee E D G r a ph E D S i m il a r i t y ( % )

40% Transformation Chance J P l ag P l agg i e S i m Sh e r l o c k - W Sh e r l o c k - S S t r i n g T il e S t r i n g E D T o k e n T il e T o k e n E D T r ee E D G r a ph E D S i m il a r i t y ( % )

60% Transformation Chance J P l ag P l agg i e S i m Sh e r l o c k - W Sh e r l o c k - S S t r i n g T il e S t r i n g E D T o k e n T il e T o k e n E D T r ee E D G r a ph E D S i m il a r i t y ( % )

80% Transformation Chance J P l ag P l agg i e S i m Sh e r l o c k - W Sh e r l o c k - S S t r i n g T il e S t r i n g E D T o k e n T il e T o k e n E D T r ee E D G r a ph E D S i m il a r i t y ( % ) Fig. 6

Average similarity of variants generated with the 5 identiﬁed source code transfor-mations, evaluated using all 11 SCPDTs. Bars indicate range of similarity scores. Red marksindicate range of standard deviation around the average.valuating the robustness of source code plagiarism detection tools 33 code similarity. Initially, there is a small decrease in average similarity for the to-ken, tree and graph-based tools. This decrease becomes progressively larger for thetoken-based tools as the chance of transformation increases. This can be explainedby the token sequences of the source ﬁles containing more and more ﬁne-grainedvariations as compared to the base program. As a result of this, the token-basedtools can no longer match as many token sub-sequences and therefore report lowersimilarity scores. The tree-based tool also incurs a consistent decrease in similar-ity, and provides similar scores. However, the graph-based tool does not suﬀer asmuch from these transformations, showing the lowest decrease in similarity at allchances of transformation. This is as the dependencies between the statementsin source code are not modiﬁed to the same degree as the token sequences. No-tably, the range of these scores is quite large. However, the standard deviation isgenerally tight around the average at the lower transformation chances. While itdoes slowly increase as the chance of transformation also increases, this is seenconsistently across most tools. The consistent standard deviation implies that thetransformations are having consistent impacts upon the evaluation of similarityby the tools.

Table 5

Average number of applied transformations to each program variant created withthe 5 identiﬁed source code transformations.Transformation Chance 10% 20% 40% 60% 80% 100%Avg. No. of Transformations 12.79 23.43 41.37 58.09 74.04 89.33

10 20 40 60 80 1000123 Transformation Chance (%) R CT S c o r e JPlag Plaggie Sim Sherlock-WSherlock-S String ED String T Token EDToken T Tree ED Graph ED

Fig. 7

Average RCT scores for variants generated with the 5 identiﬁed transformations.

Table 5 presents the average number of transformations applied to each variantat each transformation chance. This represents the average the number of ASTnodes of interest that are transformed within each variant, by all 5 transformationﬁlters. The average number of transformations are used to calculate the RCT scorefor each tool at each chance of transformation. The RCT scores are compared in

Fig. 7. Higher values indicate a greater robustness to transformation. Comparingthe performance of these tools, it is clear that the Graph ED tool is more robustwith a large number of applied transformations. This is due to the transformationshaving a lesser impact upon the representation of the programs through PDGs.However, when considering the scores at the 100% transformation chance, theaverage similarity is only approximately 10 points higher than JPlag and the NaiveToken Tiling tool. Furthermore, at the lower chances of transformation, JPlag out-performed the Naive PDG-based tool. As the lower transformation chances aremore representative of common undergraduate plagiarism-hiding modiﬁcations,JPlag can be seen to be more robust to common cases, while in pervasive casesthe Graph ED tool appears superior.Overall, the results of this experiment imply that the Graph ED tool is mostrobust to the applied transformations. At the 100% transformation chance, it hasthe highest RCT score, along with the highest average similarity of variants byapproximately 10%. Hence, the structural string, token and tree-based tools ap-pear to be most vulnerable to such structural transformations; compared to thesemantics-based tool that while it does measure a decrease in similarity, is largelydue to the previously mentioned deﬁciencies of its naive implementation. Sub-sequently, transformations that apply ﬁne-grained modiﬁcations to the structureof source code can have a cumulative impact on the eﬀectiveness of the struc-tural tools. Hence, from this it can also be implied that tools which measure thestructural similarity of source code are not robust to pervasive applications ofsource code transformations. This is an interesting point, as many plagiarism-hiding modiﬁcations are structural in nature (Faidhi and Robinson 1987; Joy andLuck 1999; Jones 2001; Mozgovoy 2006; Freire et al. 2007; Allyson et al. 2019),with all known available SCPDTs measuring structural similarity. Hence, evaluat-ing similarity through semantics and not source code structure shows potential inrobustness to pervasively applied source code transformations, on the assumptionthat they are semantics-preserving.5.2 Random TransformationsThe purpose of this experiment is to compare the robustness of the SCPDTs torandom selections of source code transformations. The use of random selectionsof transformations is arguably more realistic in the generation of test data. Aplagiarising student will not restrict them self to using only 1 transformation,instead they would most likely use a diverse selection of transformations. Hence,to add a degree of realism, random selections of transformations are utilised inthis experiment.The test data set for this experiment is generated in a similar manner to theprevious 2 experimental cases. This is with 5 variants generated per base pro-gram, for each of the 6 transformation chances, but with a random selection oftransformations applied per base program. The selection of applied transforma- tions is restricted in that between 2 and 13 transformations must be selected, andthey must not be the same 5 as identiﬁed in the previous section. This allowsfor 17,250 variants to be generated. Using random selections of transformationshas the added beneﬁt of gaining partial coverage over the total of number com-binations of source code transformations provided by SimPlag. However, in the valuating the robustness of source code plagiarism detection tools 35 A v g S i m il a i r t y ( % ) JPlag 2 4 6 8 10708090100 Repetition A v g S i m il a i r t y ( % ) Sim

10% 20% 40% 60% 80% 100%

Fig. 8

Average similarity of variants generated using random selections of transformations ateach transformation chance over 10 repetitions for SCPDTs JPlag and Sim. best-case scenario, the maximum number of selections of transformations used inthis experiment is only 575 with one unique selection for each base program. Thisdoes not aﬀord producing a statistically signiﬁcant sample of results, as speciﬁcselection of transformations may have a dis-proportional eﬀect on certain baseprograms.In order to gain a statistical conﬁdence of 95%, this experiment would needto be repeated a total of 385 times, with the results then aggregated. This is in-feasible for the equipment used in this experiment as each set of 17,250 variantsrequires approximately, 12 hrs processing; with 385 repetitions this would requireapproximately 7 months processing time. Hence, as an alternative, the experimentis repeated 10 times, generating 10 test data set and allowing for at most 5,750selections of transformations to analysed. This is to provide a much more com-prehensive analysis of results, that will at least allow for the demonstration of alocalised trend within the 10 evaluated repetitions of this experiment.Fig. 8 presents the average similarity of the generated variants over the 10repetitions using JPlag and Sim. The results for these tools exemplify the resultsof all 11 SCPDTs. All tools demonstrate a consistent trend of average similaritiesfor the variants generated with the same transformation chance being within atight range with very little variance. Hence, while the 10 sets of variants generatedfrom each base program are constructed with diﬀerent selections of transforma-tions over the 10 repetitions, on average, the transformations have a consistentimpact over the entire data set, irrespective of the base program they are appliedto. Furthermore, an interesting observation over the 10 repetitions is that the stan-dard deviations measured for the SCPDTs scores are remarkably consistent. Themaximum standard deviation of all measured standard deviations is at most 0.46.Hence, the distribution of scores measured by each SCPDT on each repetition remain similar, overall indicating a consistent spread of scores. These observationswould imply a homogeneous base data set, where the transformations applied haveon average a relatively consistent impact upon the generated variants.Fig. 9 presents the average similarity for all 10 repetitions of the generated vari-ants as the chance of transformation increases. Overall, there is not a profound J P l ag P l agg i e S i m Sh e r l o c k - W Sh e r l o c k - S S t r i n g T il e S t r i n g E D T o k e n T il e T o k e n E D T r ee E D G r a ph E D S i m il a r i t y ( % )

80% Transformation Chance J P l ag P l agg i e S i m Sh e r l o c k - W Sh e r l o c k - S S t r i n g T il e S t r i n g E D T o k e n T il e T o k e n E D T r ee E D G r a ph E D S i m il a r i t y ( % ) Fig. 9

Average similarity of variants generated with random source code transformations,evaluated using all 11 SCPDTs. Bars indicate range of similarity scores. Red marks indicaterange of standard deviation around the average.valuating the robustness of source code plagiarism detection tools 37 decrease in similarity as the chance of transformation increases (for non-string-based tools). This is easily explainable as most of the selections of transformationsdo not have a large impact on the evaluation of similarity. However, there is stilla large range of scores, indicating that while the majority of transformation selec-tions do not have a large impact on scores, there are certain selections that thetools are vulnerable to.Fig. 9 also demonstrates as that chance of transformations increases, the aver-age similarity decreases. This is an expected result, but it does imply the averagesimilarity decreases proportionally the the increase in transformations. Further-more, it again reinforces that the string-based tools are not robust to transforma-tion. This is as all string-based tools at all chances of transformation have relativelypoor results. Conversely, the token, tree and graph-based tools do show higher ro-bustness, however they do suﬀer from a large increase in standard deviation; thisindicating a much larger distribution of scores as the chance of transformation in-creases. This again implies that all string, token and tree-based tools are vulnerableto speciﬁc types of transformations. R CT S c o r e JPlag Plaggie Sim Sherlock-WSherlock-S String ED String T Token EDToken T Tree ED Graph ED

Fig. 10

Average RCT scores for variants generated with randomly selected transformations.

Table 6

Average number of applied transformations to each program variant with randomlyselected transformations.Transformation Chance 10% 20% 40% 60% 80% 100%Avg. No. of Transformations 10.67 20.28 36.65 54.35 73.77 93.29

Table 6 presents the average number of transformations applied to each pro-gram variant at each chance of transformation. This is used to calculate the RCTscore for each tool at each chance of transformation, which is compared in Fig.10. These results show similar rankings to that of the previous experiment, how-ever with one notable diﬀerence. The Graph ED tool performs considerably poorercompared to the other tools at all chances of transformation. It no longer has a greater robustness at the high chances of transformation, instead performing ap-proximately on par with other token-based tools. In this case, JPlag and the TokenED tool show a higher robustness to transformation.

Table 7

Top 3 selections of transformations each SCPDT is most vulnerable to, identiﬁedfrom the random selections of applied source code transformations. Bold ids are from the 5identiﬁed transformations in Section 5.1.1.Tool Top 3 Selections of TransformationsJPlag tRC tMC tRI tRM tSO tFW tAD tSD tAC tRC tRI tRS tSO tEA tSD tMC tRS tFW tSVPlaggie tAC tRS tRM tFW tEUtAC tRC tMC tRS tRM tSO tFW tEA tSD tAC tMC tRI tRM tSO tUD tFW tEA tSV tAD tSD

Sim tRC tRM tSO tAD tSD tAC tRC tMC tRS tRM tSO tFW tEA tSD tRC tRM tSO tAD tSD

Sherlock-W tAC tMC tRI tRS tRM tSO tUD tFW tEUtRC tRI tRS tRM tSO tFW tEA tSV tADtAC tRC tMC tRS tRM tSO tUD tEU tSV tSD

Sherlock-S tRC tRI tRS tRM tUD tEU tSV tSD tMC tRI tSO tUD tEU tSV tADtMC tRI tEA tEU tSD

String Tile tAC tRI tRS tRM tSO tUD tFW tEA tEU tSVtRC tRI tRS tRM tSO tUD tADtMC tRI tRM tSO tUD tEA tEU tSV tSD

String ED tAC tRC tMC tRM tFW tSV tSD tAC tRC tMC tRI tADtMC tRI tRS tRM tUD tEA tEU tSV tSD

Token Tile tMC tRI tRS tRM tUD tEU tSVtAC tRC tMC tRM tSO tUD tEU tSV tAD tSD tRS tRM tSO tEU tSVToken ED tAC tRC tMC tRS tRM tUD tEA tAD tAD tSD tRC tRM tSO tAD tSD tAC tRC tMC tRS tRM tSO tFW tEA tSD

Tree ED tAC tRS tRM tFW tEUtMC tRS tRM tSV tAD tRS tRM tSO tEU tSVGraph ED tAC tRC tMC tRI tRS tRM tSO tFW tEA tSD tRC tRS tRM tFW tSV tSD tAC tRC tMC tRI tRS tRM tUD tEA tAD tSD

In order to gain a better understanding of what selections are most eﬀectiveat reducing similarity, Table 7 ranks the top three selections of transformationsfor each tool that incurs the greatest decreases in similarity. There is a large degree of overlap between the top three selections. Furthermore, there is correlationbetween these selections, and the transformations identiﬁed from Fig. 5 (i.e. tRS,tRM, tSO, tFW, tSD). These selections always include at least one transformationfrom the ﬁve identiﬁed in 5.1. This again strongly implies that selections of theseﬁve transformations have a considerable impact upon the tools, and hence the valuating the robustness of source code plagiarism detection tools 39

SCPDTs are not robust to such transformations. Table 7 also aids in explainingthe comparatively poorer results of the Graph ED tool. tRI and tSD are appliedin two out of three cases, causing the Graph ED tool to report a lower similarity.Amongst most of applied selections, comment-changing transformations are alsoincluded. Realistically, these transformations only aﬀect the string-based tools, asall others simply remove the commenting. Their inclusion in this table is simplydue to random chance, and should not be considered to be a contributing factorto being ranked in the top 3 selections for non-string tools.From these results where randomised transformations are applied; JPlag andthe token-based tools in general are the most robust to transformation. However,it also shows that when ﬁne-grained modiﬁcations to token sequences are applied,the tools become less robust to transformation. However, a high level of simi-larity is reported for all non-string-based tools, implying such tools are suitablefor detecting real cases source code plagiarism with plagiarism-hiding source codetransformations.5.3 All TransformationsThe purpose of this experiment is to identify the impact of all source code trans-formations being applied in unison to the generated variants. This is to simulatean extreme case of plagiarism, where a plagiariser applies a diverse range of sourcecode transformations pervasively to hide plagiarism. The generated data set forthis experiment contains 17,250 variant programs. This is broken down into 5 vari-ants for each of the 575 base programs, generated with the 14 transformations ateach of the 6 transformation chances.Fig. 11 presents the average similarity of variants evaluated with each SCPDT.In addition to the average similarity, the range and standard deviation of scoresis also presented to indicate the distribution of variant similarities and wherethe similarity of most variants lies. Initially at the 10% transformation chance,the average similarity is high for all non-string-based tools. The similarity thenprogressively decreases up until the 100% transformation chance where all averageslie at or below the 60% similarity mark. Furthermore, the standard deviations doincrease slightly as the chance of transformation increases. This indicates that thedistribution of scores is increasing with the chance of transformation.Most interestingly, all tools are severely impacted at the 80% and 100% trans-formation chances. For example, at 80% transformation chance, JPlag drops ap-proximately 30% in its measured average similarity. Furthermore, at the 60%chance, there is a noticeable decrease in similarity with all but two tools droppingbelow 70% similarity. In such cases, it can be conceived that plagiarised programsto this severity could go unnoticed. Another interesting observation is the generalplotting of the average scores and standard deviations remains relatively consis-tent between all chances of transformations. This again implies that the numberof transformations applied to the source codes has a proportional impact on the evaluated similarity; and that the applied transformations have consistent impactsacross all tools.Table 8 identiﬁes the average number of transformations applied to the variantsas the chance of transformation increases. This is used to calculate the RCT scorefor each tool at each chance of transformation, that is compared in Fig. 12. From J P l ag P l agg i e S i m Sh e r l o c k - W Sh e r l o c k - S S t r i n g T il e S t r i n g E D T o k e n T il e T o k e n E D T r ee E D G r a ph E D S i m il a r i t y ( % )

80% Transformation Chance J P l ag P l agg i e S i m Sh e r l o c k - W Sh e r l o c k - S S t r i n g T il e S t r i n g E D T o k e n T il e T o k e n E D T r ee E D G r a ph E D S i m il a r i t y ( % ) Fig. 11

Average similarity of variants generated with all 14 source code transformations,evaluated using all 11 SCPDTs. Bars indicate range of similarity scores. Red marks indicaterange of standard deviation around the average.valuating the robustness of source code plagiarism detection tools 41

Table 8

Average number of applied transformations to each program variant created with alltransformations.Transformation Chance 10% 20% 40% 60% 80% 100%Avg. No. of Transformations 21.63 41.76 80.03 120.71 165.4 212.3

10 20 40 60 80 100012345 Transformation Chance (%) R CT S c o r e JPlag Plaggie Sim Sherlock-WSherlock-S String ED String T Token EDToken T Tree ED Graph ED

Fig. 12

Average RCT scores for variants generated with all transformations. this result, initially JPlag is shown to be the most robust tool, however at the100% transformation chance Token ED tool shows slightly greater robustness.This is interesting as a naive tool with a similar approach (diﬀering in algorithmsto compare similarity) has out-performed a mature academic tool in the mostpervasive case of simulated plagiarism.The results of this experiment indicates that JPlag and the Token ED tool arethe most robust to the pervasive application of all transformations. While theirevaluated similarities have the potential to not raise suspicion and evade detectionat high chances of transformation, they do show the least decrease in average evalu-ated similarity. Interestingly, while both of these tools are token-based, they utilisediﬀerent methods to measure similarity. JPlag uses greedy token tiling, that is ap-proximated in the Token Tile tool. However, the Token Tile shows much poorerperformance than JPlag. This can be attributed to the naive implementation ofthe tool, that approximates greedy token tiling by repetitively identifying thelongest common sub-string. While JPlag implements a tolerance for mis-matchedtoken sub-strings, the naive implementation does not. Hence, it is much less robustthan JPlag. The Graph ED tool shows a close ranking in third place at the 100%transformation chance. However, it initially demonstrated a relatively lower RCTscore at lower chances of transformation. However, as the number of transforma-tions increase it shows a higher robustness to transformation. This would implythat the PDG-based tool is more robust in the more pervasive cases of applied plagiarism-hiding transformations.Overall, this is a positive result indicating token-based approaches have a goodrobustness to transformation. However, there is also a large emphasis on toolsimplementing a tolerance to error in mis-matched portions of a program in order togain greater robustness to transformation. The Token ED tool can accommodate for small errors simply through the deletion/swapping of tokens. However, theToken Tile tool implements no such feature and hence suﬀers from a poor evaluatedsimilarity.

This evaluation will compare the robustness of SCPDTs to source code injection.Code injection poses an issue in the detection of plagiarism as by design, mostSCPDTs present program similarity as the aggregation of similarity scores fromall source code ﬁles. However, by aggregating the scores it has the potential tohide the fact that code has been injected. Furthermore, code injection can be seento ‘increase’ the size of the injected source code. As a result of this, it adds codethat cannot be matched against the source of the plagiarised work.Two experiments are performed to evaluate the robustness of tools. The ﬁrstexperiment will evaluate the robustness of each SCPDT to data sets generated byinjecting one type of source code fragment into each variant (i.e. ﬁle, class, meth-ods and statements, individually). The second experiment will evaluate SCPDTrobustness to a data set generated by injecting all source code fragments in uni-son. The source code fragments will be injected with injection chances 10%, 20%,40%, 60%, 80% and 100%, as with the previous evaluation. However, SimPlagalso exposes four injection-speciﬁc conﬁguration parameters to limit the quantityof source code fragments injected. Without specifying limits on the quantity ofsource code fragments injected, it can result in SimPlag generating variants thatare substantially larger in size than the original base programs. For the purposeof this experiment, the quantities of injected source code fragments are set to belimited with relative values to the average composition of the base data set (seeTable 3). This means, the quantity of injected fragments are limited to at most: –

12 statements into each method (but no more than the method’s original state-ment count), – – – of source code fragment injected, and will be discussed as appropriate. For thisevaluation, a data set of assignment submissions from unrelated programmingcourses is used as the seed for injected source code fragments. This is to avoid anyunintentional cases where a variant is injected with source code originating fromits base program, or from an otherwise similar program. valuating the robustness of source code plagiarism detection tools 43 When using SimPlag to generate data sets for evaluating the eﬀect of sourcecode injection, there is one notable problem that may cause unreliability in theevaluated results. SimPlag will apply a consistent formatting to all generated vari-ants as to make them appear more realistic, and hence applies by default a simple(but pervasive) source code transformation. This will have no substantial eﬀect onthe non-string-based tools as they are immune to such a transformation. However,this has the potential to impact upon the results of all string-based tools, as for-matting is not ignored with such SCPTs. Hence, in order to minimise the eﬀect ofthe consistent formatting applied by SimPlag, the base programs are normalisedwith the same consistent formatting applied. This is to improve upon the overallreliability of results.6.1 Individual Fragment Types InjectedThis experiment will compare the impact of injecting each of the 4 types of sourcecode fragments upon each SCPDT in isolation. The generated data set for thisexperiment contains 69,000 variant programs. This is broken down into 5 variantsfor each of the 575 base programs, generated with the 4 fragments of source codeinjected individually, at each of the 6 injection chances.

Table 9

Average logical lines of code injected and increase in size (%) for each type of injectedsource code fragment at each chance of injection.Injection Chance10% 20% 40% 60% 80% 100%File 38.22 71.27 115.38 187.19 282.88 322.8511.37% 21.21% 34.34% 55.71% 84.19% 96.08%Class 222.01 242.71 312.69 437.72 598.56 793.9666.07% 72.23% 93.05% 130.26% 178.13% 236.28%Method 75.87 107.25 173.07 265.28 360.55 471.8722.58% 31.92% 51.50% 78.95% 107.30% 140.42%Statement 35.86 52.64 83.35 117.09 151.54 185.6010.67% 15.67% 24.80% 34.85% 45.10% 55.23%

Table 9 identiﬁes the average lines of code injected into each variant at eachchance of injection. At each chance of injection, the quantity of LLOC injectedvaries signiﬁcantly. The ﬁle and statement data set have similar LLOC injected,but the class and method data sets have comparatively much larger quantitiesof source code injected. This is a result of the utilised conﬁguration values, andthe nature of how fragment injection is implemented in SimPlag. SimPlag uses ashared java.util.Random object (that produces a uniform distribution of randomvalues) to determine if a fragment of source code should be injected into a variant.

This is determined at the nodes of interest, that each have diﬀerent occurrencesin a single variant. For example, the ﬁle fragment injector has 1 ’node’ of interestin each variant (being the variant itself), while the class injector has an averageof 4.44 nodes of interest in each variant. Using the selected conﬁguration values,SimPlag has a single weighted chance to inject up to 4 ﬁles into each variant, while it has on average 4.44 weighted chances to inject 1 class into a ﬁle. This slightdiﬀerence in semantics has a noticeable impact upon the generated variants, withapproximately 2-2.5 times the quantity of LLOC being injected into the variants.However, this will not aﬀect the results of this experiment. The RCI score canonly be used to compare SCPDTs on the same data set, and not on diﬀerent datasets. Hence, the generated test data sets will allow for comparison of the SCPDTsagainst the injection of each of the 4 types of source code fragment in isolation.

10% Injection ChanceFile Class Method StatementInjected Source CodeJPlagPlaggieSimSherlock-WSherlock-SString TileString EDToken TileToken EDTree EDGraph ED D e t ec ti on T oo l

20% Injection ChanceFile Class Method StatementInjected Source CodeJPlagPlaggieSimSherlock-WSherlock-SString TileString EDToken TileToken EDTree EDGraph ED D e t ec ti on T oo l

40% Injection ChanceFile Class Method StatementInjected Source CodeJPlagPlaggieSimSherlock-WSherlock-SString TileString EDToken TileToken EDTree EDGraph ED D e t ec ti on T oo l

60% Injection ChanceFile Class Method StatementInjected Source CodeJPlagPlaggieSimSherlock-WSherlock-SString TileString EDToken TileToken EDTree EDGraph ED D e t ec ti on T oo l

80% Injection ChanceFile Class Method StatementInjected Source CodeJPlagPlaggieSimSherlock-WSherlock-SString TileString EDToken TileToken EDTree EDGraph ED D e t ec ti on T oo l D e t ec ti on T oo l Average Variant Similarity0% 25% 50% 75% 100%

Fig. 13

Heat maps representing the average similarity of variants generated with each indi-vidual source code fragments injected. Darker colours indicate a higher similarity scores, andhence higher robustness to transformation.valuating the robustness of source code plagiarism detection tools 45

Fig. 13 presents the average similarity of the variants created with each of thefour individual source code injection operations. Initially, this appears similar tothe heat maps in Fig. 5. The similarity at the 10% injection chance is typicallyhigh (larger than 90%) for all non-string-based tools. However, it then proceedsto drop as the injection chance increases. Comparing Figs. 5 and 13, the decreasein similarity is much more profound in the presence of source code injection, ascompared to the application of source code transformations. This is expected asnew code that cannot be matched to the base program is introduced into thevariants, and the relative quantity of injected LLOC.The string-based tools again produce consistently poor results. However, theyare slightly improved compared to the results in Fig. 5, yet still produce con-siderably low similarity scores. The non-string-based tools initially evaluate highsimilarity scores for all types of injected source code fragments. However, thisquickly begins to drop as the injection chance increases, showing a consistent pro-gression of decreasing similarity. Notably, the similarity of the variants generatedwith class and method fragment injection drops much faster than the ﬁle andstatement injection. However, in interpreting this result, it must be considered thegreater quantity of LLOC injected for classes and methods, compared to ﬁles andstatements; and hence, this result can be expected.For the ﬁle injection results, it is interesting to observe that the non-string-based naive SCPDTs demonstrate a higher average similarity than the academicSCPDTs. Comparing the token-based tools, Sim has the greatest decrease in ﬁlesimilarity. This is caused by Sim not providing a submission-wise similarity score(only ﬁle-wise). To derive an aggregated submission score, the best similarities forall ﬁles are averaged during comparison. Hence, in this case no ﬁles are ignored.All naive tools use a similar method of score aggregation, scoring similarity as theintersection of the two programs; i.e. the coverage of matching sections divided bythe total size of the compared programs. However, in this case the naive SCPDTsare capable of evaluating a higher similarity than the other SCPDTs. Amongst allother fragment injections, the non-naive SCPDTs demonstrate consistently highersimilarity scores. In particular, Plaggie and Sim demonstrate the highest averagesimilarity against class and method injection, while JPlag and Sim evaluate thehigher average similarities against statement injection.In order to rank the tools, the average RCI is calculated for all variants cre-ated with each mode of source code injection, at each chance of injection. Thesescores are compared in Fig. 14. The results vary between the tools for diﬀerentmodes of injection. The string-based tools (excluding Sherlock-Warwick) are con-sistently ranked lower; implying string-based tools are not robust to any form ofsource code injection. The string-based tools are consistently followed by Sherlock-Warwick, that combines token and string-based methods. Due to this combinationof methods, Sherlock-Warwick shows greater robustness than the string-only tools,however it also results in the tool performing worse than the dedicated token-basedtools; while the token, tree and graph-based tools consistently demonstrate greaterrobustness.

Across ﬁle, class and method injection, most tools show a slight trend of in-creasing in robustness as the chance of transformation also increases. This doesnot imply the tools are less susceptible to greater quantities of injected sourcecode, simply that the rate of similarity change is decreasing. However, this is notthe case for statement injection. Comparatively, much fewer statements need to be

10 20 40 60 80 10005101520 Injection Chance (%) R C I S c o r e File Injection 10 20 40 60 80 1000102030 Injection Chance (%) R C I S c o r e Class Injection10 20 40 60 80 10005101520 Injection Chance (%) R C I S c o r e Method Injection 10 20 40 60 80 1000246810 Injection Chance (%) R C I S c o r e Statement Injection

JPlag Plaggie Sim Sherlock-WSherlock-S String ED String T Token EDToken T Tree ED Graph ED

Fig. 14

Average RCI for each type of injected source code fragment at each chance of injection.Larger values indicate a tool is more robust to source code injection. injected to reduce the RCI score. As a result of this, statement injection has thegreatest impact on similarity, and hence the tools demonstrate less robustness toit. This is related to the ﬁne-grained transformations of token sequences identiﬁedin Evaluation 1. Injecting whole statements amongst the existing statements hasthe potential to interrupt these sequences with new unrelated statements. Hence,there are cases where statement injection lowers the overall score by both increas-ing the size of the variant programs, and prohibiting the tools from matchingexisting code. This results in a lower similarity and RCI score.From these results, it can be generalised that when each mode of source codeinjection is applied in isolation: – The naive Graph ED and Token ED tools are most robust to ﬁle injection,while Plaggie is the most robust non-naive SCPDT; – Plaggie & Sim show the greatest robustness to class & method injection; and – JPlag & Sim are most robust to statement injection. valuating the robustness of source code plagiarism detection tools 47

Table 10

Average logical lines of code injected into each variant generated with all types ofinjected source code fragments at each injection chance.Injection Chance 10% 20% 40% 60% 80% 100%Avg. LLOC Injected 189.50 361.96 643.80 969.95 1,249.63 1,631.64LLOC Increase (%) 56.39 107.72 191.59 288.65 385.27 485.56

Table 10 presents the average LLOC injected into the generated variants. Thistable shows a consistent increase in lines injected, averaging approximately 170lines into each variant per 10% chance of injection. It should be noted that inthis case, the number of LLOC injected into the variants is not expected to berepresentative of simple undergraduate plagiarism. The extreme cases are morerepresentative of students collaborating, or cases where fragments of source codehave been appropriated and integrated into an other’s own work. However, thisdoes serve to demonstrate the impact on similarity when large quantities of sourcecode are injected.Fig. 15 presents the average similarities of the generated data set for this ex-periment. Immediately there is a noticeable decrease in similarity for all tools. Thisis much more profound than the results of the source code fragments injected inisolation. However, one tool consistently ranks with the highest average similarity:Plaggie; being in a consistent range of at most approximately 10 percentage pointshigher than all other tools. Initially, the results of all non-string-based tools areconsistently within a range of 10%. However, as the number of LLOC increases,this range of scores begins to increase. This implies with large quantities of sourcecode injected, certain tools begin to become less robust to injection. Notably, theString ED tool begins to perform on par with the token, tree and graph-based toolswith large quantities of source code injected. This is not implying that String EDis more robust in the presence of large quantities of source code injected. Simplythat all other tools are signiﬁcantly impaired in this scenario.The large decrease in similarity for all tools is explained by the signiﬁcantlylarger LLOC injected compared to the test data set in Section 6.1. More fragmentsof source code being injected implies less source code that can be matched between the variants and their base programs. Hence, the results from this experiment doindicate that in extreme cases where large quantities of source code are injectedinto a program, the evaluated similarity will drop signiﬁcantly. With these resultsit also needs to be considered the dramatic increase in size of the injected LLOCto reduce the similarity of the variants. A ‘smaller’ quantity of injected code does J P l ag P l agg i e S i m Sh e r l o c k - W Sh e r l o c k - S S t r i n g T il e S t r i n g E D T o k e n T il e T o k e n E D T r ee E D G r a ph E D S i m il a r i t y ( % )

10% Injection Chance J P l ag P l agg i e S i m Sh e r l o c k - W Sh e r l o c k - S S t r i n g T il e S t r i n g E D T o k e n T il e T o k e n E D T r ee E D G r a ph E D S i m il a r i t y ( % )

20% Injection Chance J P l ag P l agg i e S i m Sh e r l o c k - W Sh e r l o c k - S S t r i n g T il e S t r i n g E D T o k e n T il e T o k e n E D T r ee E D G r a ph E D S i m il a r i t y ( % )

40% Injection Chance J P l ag P l agg i e S i m Sh e r l o c k - W Sh e r l o c k - S S t r i n g T il e S t r i n g E D T o k e n T il e T o k e n E D T r ee E D G r a ph E D S i m il a r i t y ( % )

60% Injection Chance J P l ag P l agg i e S i m Sh e r l o c k - W Sh e r l o c k - S S t r i n g T il e S t r i n g E D T o k e n T il e T o k e n E D T r ee E D G r a ph E D S i m il a r i t y ( % )

80% Injection Chance J P l ag P l agg i e S i m Sh e r l o c k - W Sh e r l o c k - S S t r i n g T il e S t r i n g E D T o k e n T il e T o k e n E D T r ee E D G r a ph E D S i m il a r i t y ( % ) Fig. 15

Average similarity of variants generated with all 4 source code fragments injected,evaluated using all 11 SCPDTs. Bars indicate range of similarity scores. Red marks indicaterange of standard deviation around the average.valuating the robustness of source code plagiarism detection tools 49 not see a dramatic decrease in similarity for non-string based tools. For example,JPlag requires 189 LLOC to be injected (equivalent to a 56% increase in LLOCin this data set) to demonstrate a decrease in 20% similarity. Even at the 20%injection chance where an average of 362 LOC are injected, the similarity scorestypically show a decrease of approximately 30% similarity. Such scores would mostlikely raise suspicion to a reviewer, and likewise would require considerable eﬀorton the behalf of a plagiariser to commit. Hence, the tools are generally robust tosource code injection within reasonable quantities of injected code. It is only withextremely large amounts of code injected that the tools begin to see a profounddecrease in evaluated similarity scores.

10 20 40 60 80 1000102030 Injection Chance (%) R C I S c o r e JPlag Plaggie Sim Sherlock-WSherlock-S String ED String T Token EDToken T Tree ED Graph ED

Fig. 16

Average RCI Scores for variants generated with all source code fragments injected.

In order to rank the tools for robustness against code injection, the averageRCI score is evaluated for all tools at each chance of injection. These scores arecompared in Fig. 16. All tools show a common trend of an increasing RCI as theLOC injected is increased. This indicates that smaller quantities of injected codefragments have a comparatively greater impact on the evaluation of similarity. Thisshows some consistency with the results of Section 6.1. However, this increase inrobustness is considerably larger. The results in Figs. 15 & 16 show a clear rankingof tools for this experiment. String-based tools consistently perform poorly and aretypically ranked last. While the token, tree and graph-based tools show a greater,and consistently higher robustness to injection. Across all chances of injection, thetop 3 rankings of tools from this experiment are:1. Plaggie2. Token ED & Graph ED3. JPlag & Tree ED.

The evaluations performed in this article were designed explore the robustness ofSCPDTs to plagiarism-hiding modiﬁcations in order to answer the three researchquestions:

RQ1:

What are the impacts of source code transformations on SCPDTs?

RQ2:

What is the impact of source code injection on SCPDTs?

RQ3:

What SCPDT is most robust to plagiarism-hiding modiﬁcations?

Each question will be discussed and answered in the following sub-sections.7.1 RQ1: Impact of Source Code TransformationsEvaluation 1 was designed to answer

RQ1 . This question was explored through3 experiments (plus an extended case) that evaluated the 11 SCPDTs againstdiﬀerent selections of source code transformations, allowing for a comparison ofSCPDT robustness. The results of the 3 experiments demonstrated four consis-tent observations regarding the impacts of source code transformations upon theSCPDTs:1. All string-based tools show poor robustness to all source code transformations,when compared to the non-string-based tools.2. The token and tree-based tools demonstrated little impact by cosmetic sourcecode transformations.3. The token and tree-based tools demonstrated vulnerability against ﬁne-grainedstructural source code transformations.4. The graph-based tool demonstrated greater robustness against ﬁne-grainedstructural source code transformations than all other tools, however was vul-nerable to transformations that were lexical or modiﬁed the program semantics.These results can be largely explained by considering the SCPDTs in termsof how they represent source code, and how this representation is used to mea-sure source code similarity. The string, token and tree-based tools represent thestructure of source code. The string-based tools represent structure as a literalcharacter string, the token-based tools represent structure as sequences of lexicaltokens, and the tree-based tool represent structure as an AST. These three repre-sentations of structure are then compared for similarity with two basic techniques:coverage (e.g. through tiling) or edit distance. Any transformation to this struc-tural representation will either impact upon the evaluation of coverage, or increasethe edit distance. That is, any transformation that changes: the character stringwill impact upon string-based tools, the lexical token sequence will impact upontoken-based tools, the AST will impact upon tree-based tools. Hence, in all cases,the measurement of similarity was impacted upon.From the performed experiments it was demonstrated that of the 14 imple-mented source code transformations: – All 14 modiﬁed character strings, and hence impacted upon the string-basedtools. – remaining 9 transformations had no or negligible impact upon the token-basedtools. – The same 5 transformations (tRS, tRM, tSO, tFW, tSD) have a similar, butlargely less pronounced impact upon the tree-based tool, with the exception oftRM having a extremely pronounced impact upon the tree-based tool. valuating the robustness of source code plagiarism detection tools 51

Hence, from these results, it can be summarised that source code transfor-mations impact upon SCPDTs when the representation of source code a SCPDTuses for comparing similarity is modiﬁed. This was speciﬁcally found with the 5ﬁne-grained structural transformations: tRS, tRM, tSO, tFW & tSD in Section5.1.1; and further emphasised in Sections 5.2 and 5.3. This is a notable vulnerabil-ity with currently available SCPDTs as all implement structure-based measures ofsource code similarity. Hence, they are all vulnerable to such transformations. Thisbecomes especially pronounced when the transformations are applied pervasively.If such transformations are applied pervasively throughout a body of source code,a large decrease in the evaluated similarity can be observed. Furthermore, thisresult is consistent with the naive string, token and tree-based tools, indicatingsuch techniques in general are vulnerable to the same structural transformations.In order to avoid the impact of plagiarism-hiding source code transformationsthat modify the structural representation of the source code, the simplest methodis to ignore the structural aspects of source code that change due to the transfor-mations. This can be seen with the Graph ED tool that represents the program asa set of PDGs. This allows Graph ED to measure the semantic similarity of twoprograms, through the occurrence of similar relations between statements in eachprocedure. Assuming that source code transformations do not modify the seman-tics of the source code - an assumption that, as previously identiﬁed, is commonlytrue in committed source code plagiarism; such an approach should be in theoryimmune to such changes. This was largely demonstrated for the Graph ED tool inSection 5.1 where it was shown to be impacted upon by transformations that mod-iﬁed the semantics of the program; and in Section 5.1.1, where it demonstratedthe greatest robustness to the 5 ﬁne-grained structural transformations. Hence,the theoretical foundation of a PDG-based tool does show merit in the detectionof source code plagiarism.With this result, it must also be considered how easy the utilised transforma-tions are to apply by a plagiariser. The 14 source code transformations are nottechnically complex, and with enough time a novice plagiariser could apply themmanually. If considering a skilled plagiariser (i.e. somebody proﬁcient at program-ming) it is conceivable that they would have the necessary skills to apply thesetransformations, as well as more complex transformations such as obfuscating thecontrol ﬂow of the program. Furthermore, it must be considered that all of theapplied transformations are automated, and are in fact features of many sourcecode editors and integrated development environment. Hence, applying source codetransformations even in a pervasive manner is a trivial task that needs to be ac-counted for by SCPDTs, and hence this is a threat to currently available SCPDTs.7.2 RQ2: Impact of Source Code InjectionEvaluation 2 was designed to answer

RQ2 . This was explored through two ex-periments comparing the aﬀect of injecting fragments of source code upon the 11

SCPDTs. The impact of source code injection can be generalised from the perspec-tive of by injecting source code, new code is inserted into a program that cannotbe matched between it and it’s source. However, the results of the two experimentsevaluating this impact vary considerably. Similar to evaluation 1, the string-basedtools were demonstrated to show great vulnerability to the injection of all frag- ments of source code; while in comparison the token, tree and graph-based toolsdemonstrated considerably greater, but varied, robustness against the injection ofsource code fragments. Overall, the results varied between the individual tools, andthe types of source code fragments injected. This would indicate that by design,diﬀerent tools are more robust to the injection of certain fragments of source code.Hence, interesting observations regarding the impact of source code injection, andhow robustness can subsequently be gained.Against ﬁle injection, the non-string-based naive SCPDTs were demonstratedto have greater robustness than all other tools. This is presumably a result of howthe naive SCPDTs aggregate the similarity scores of ﬁle pairs into a submission-wise similarity scores. This is through the averaging of similarity scores from thebest mapping of ﬁle pairs between two submissions. In such a case, it can be as-sumed that in an ideal case where the naive tools evaluate 100% similarity betweenthe average 4.41 identical ﬁle pairs between a base and variant program, the in-jection of 1 to 4 ﬁles by SimPlag would result in evaluated similarity scores in arange of approximately 69% to 90%. This is consistent with the results of Section6.1, where at the 100% injection chance the non-string naive SCPDTs evaluatesimilarity scores approximately in this range. The remainder SCPDTs utilise sim-ilar methods, but do not always ‘map’ individual source ﬁles. For example, basedon the implementation Plaggie, the tool aggregate all source ﬁles together into asingle token stream, and use its relative size for the calculation of similarity. Thissubtle diﬀerence results in the tool being unable to measure any signiﬁcant simi-larity between the source code in the injected source ﬁles. And hence, it producessimilar, but in this case lower results.The injection of class and method fragments eﬀectively adds ‘junk’ to thevariant programs, increasing the overall size of the variant sources. Injecting suchfragments should not aﬀect matching any existing code (assuming suﬃcient sizein relation to SCPDT minimum match lengths), but merely increase the size ofthe programs being compared. The theoretical decrease in similarity can again beassumed to be proportional to the size increase. In the case of method injectionat the 100% injection chance, the variants are on average approximately 140%larger compared to the base programs, equating to approximately 472 LLOC beinginjected. This is approximately 40% of the total LLOC being compared betweenany base program and its variants (the total LLOC being 2 times the average basesize of 336 LLOC, plus the injected LLOC). Hence, in a best case scenario, 60%of the LLOC between the base and variant can be matched, that would resultin a similarity scores of approximately 60%. A similar relation can be assumedfor class injection at the 100% injection chance, where approximately 50% of theLLOC being compared is injected, resulting in a similarity score of approximately50%. This assumption largely holds for the non-string SCPDTs, where against classinjection most tools see a 50% average similarity, while against method injectionmost tools see a 60% average similarity. However, Plaggie and Sim are clear outliersin this case, where the tools demonstrate greater robustness. Hence, in this case,Plaggie and Sim are least susceptible to the injection of this junk.

The injection of statement fragments has a similar impact by again adding‘junk’ to the variants. If the size increase is again used to predict the similarity ofthe variant, at the 100% injection chance it should be assumed that the variantswould have an average similarity of approximately 80%. However, this is largely notthe case. Most SCPDTs evaluate the average similarity in a range of 40% to 60%, valuating the robustness of source code plagiarism detection tools 53 while JPlag and Sim evaluate similarity scores of approximately 75%. In this case,the injection of individual statements is somewhat analogous to the ﬁne-grainedstructural transformations identiﬁed from evaluation 1. By mixing small fragmentsof junk with existing code, it can be assumed that this prohibits the matchingthe base code in the variant. Hence, there is a less than expected evaluation ofsimilarity by the SCPDTs. A similar occurrence is seen when injecting all 4 typesof source code fragments. At the 100% injection chance, with a 485% increase invariant size, it can be expected approximately 70% of the analysed LLOC will notbe found in the base program, and hence will result in an approximate averagesimilarity of 30%. This observation holds for most tools. However, Plaggie, TokenED and Graph ED demonstrate higher average similarities at approximately 40%.Hence, these tools demonstrated greater robustness when all 4 types of source codefragments are injected in unison.However, in consideration of these results, it must be acknowledged that thequantity of injected fragments of source code border into a ‘worst-case scenario’.As discussed, at the higher injection chances, the generated data sets would bemore representative of a plagiarising student integrating an other’s work into theirown. Hence, if only the results for non-string-based tools are considered up to the10-20% injection chance, all tools were demonstrated to evaluate a high averagesimilarity. However, in this worst case scenario, diﬀerent SCPDTs demonstrategreater robustness to the injection of diﬀerent types of source code fragments.This result also emphasises the methods of comparing ﬁles for similarity to ignoreinjected code, and aggregating scores for similarity in providing robustness tosource code injection. Hence this should be considered in detection of pervasivelymodiﬁed plagiarised source code.7.3 RQ3: Most Robust ToolThe results of evaluations 1 and 2 address

RQ3 . From the 5 performed ex-periments, it is clear that all string-based tools are not robust to any form ofplagiarism-hiding modiﬁcations. This is demonstrated through the consistentlylow similarity scores evaluated by the tools; and hence string-based techniquesshould be avoided for use in SCPD. The token, tree and graph-based tools showmuch greater robustness to source code modiﬁcations. However, identifying whattool is the most robust to plagiarism-hiding modiﬁcations is a matter of perspec-tive.Strictly from the results of evaluation 1, JPlag demonstrated the greatest ro-bustness to source code transformations. This was shown by JPlag most con-sistently evaluating the highest similarity scores, and hence RCT scores, in thethree experiments of evaluation 1 (Sections, 5.1, 5.2 and 5.3). However, JPlag isout-performed by the naive Graph ED tool in Section 5.1.1 when compared for ro-bustness against the 5 identiﬁed ﬁne-grained structural transformations. This is aresult of the Graph ED tool being more robust against such transformations when pervasively applied than JPlag, and all other of the evaluated SCPDTs. Hence,overall, from the results of this evaluation, JPlag should be considered to be themost robust SCPDT on average to the evaluated plagiarism-hiding transforma-tions. However, when considering the results against source injection in evaluation2, Plaggie was shown to be the most robust in 4 of the 5 generated data sets. This is followed by Sim, showing a similar robustness in 3 of the 5 generated data sets.While a notable mention is the Graph ED tool, being consistently ranked high in4 of the 5 generated data sets. Hence, in considering these results, it is a matterof perspective in determining what tool is most robust.To generalise the answer to

RQ3 , JPlag is the most robust tool to the eval-uated plagiarism-hiding transformations, while Plaggie is the most robust toolagainst the injection of the evaluated source code fragments. But, under certainconditions with pervasively transformed variants, the Graph ED tool does showpotential. Hence, utilising such a tool, or potentially one that combines both struc-tural and semantic measures, shows beneﬁt as a future direction of work in SCPD.In Section 5.1.1, this tool demonstrated a greater robustness to pervasively appliedtransformations which apply ﬁne-grained transformations to the source code struc-ture, while it also ranks consistently high against both source code transformationand injection. While in most experiments it performed approximately on par withthe other non-string-based tools, this is attributed to the naive implementation ofthe tool. It is feasible that an optimal implementation of a PDG-based Graph EDtool would out-perform all other SCPDTs in most, if not all experiments.However, in typical cases of plagiarism, all non-string-based tools show suf-ﬁcient robustness to source code modiﬁcation, being capable of evaluating highsimilarity scores. But, this result does not mean that all other tools are not suit-able for detecting plagiarism is the presence of plagiarism-hiding modiﬁcations.Considering the similarity scores of the lesser-transformed program variants (as-sume ≤

40% transformation chance); in almost all cases with the non-string-basedtools, the evaluated similarity scores are generally enough to raise suspicion. Hence,for typical usage, there is no problem with their robustness to plagiarism-hidingmodiﬁcations. It is only in the extreme evaluated cases with pervasively appliedsource code modiﬁcations, do JPlag and Plaggie demonstrate greater robustness.

In the evaluations performed in this work, there are numerous design decisionsthat originate from the utilised tooling and evaluation method. In this section,important limitations of the evaluation, and threats to validity of results are iden-tiﬁed and discussed; focusing on conﬁguration bias of the SCPDTs, authenticityand correctness of the generated test data, and the measures used for comparingSCPDT robustness.8.1 Conﬁguration Bias of SCPDTsIn evaluations of code similarity tools, it is common to identify an ‘optimal’ con-ﬁguration value for a tool on a given data set. This has been shown to providegreater tool performance (e.g. as seen in Ragkhitwetsagul et al. (2018); Ahadi and

Mathieson (2019)). The performed evaluations do not attempt to identify an op-timal conﬁguration value. This is an intentional design decision of this evaluationto remove tool bias, as arguably, identifying an optimal conﬁguration has the po-tential to introduce conﬁguration bias in the performed evaluations. This is dueto, as discussed, using optimal conﬁgurations is not considered representative of valuating the robustness of source code plagiarism detection tools 55 a real-world use of SCPDTs. Using an optimal conﬁguration value for a SCPDTrequires the foresight of knowing in advance what submissions are plagiarised.This is of course not the case in a real-world use of SCPDTs and may give a falseimpression of how robust a SCPDT may be in a real-world use. Hence, to removeany conﬁguration bias, all SCPDTs are executed using their default conﬁgurationvalues under the assumption that the original developers of each SCPDT selectedappropriate defaults.By extension utilising purpose-built naive SCPDTs has the potential to intro-duce bias into this evaluation. It would be trivial to use conﬁguration values toimprove or skew the results of these tools. E.g. the performance of tiling toolscan be improved by decreasing the minimum match length. Hence, to reduce biasthrough the use of naive SCPDTs, they utilise conﬁguration values that can besourced from similar SCPDTs, or justiﬁed based on the utilised data sets. i.e. thenaive String Tile tool uses a minimum match length derived from the averageexpression size in the data set, while the naive Token Tile tool uses the sameminimum match length as JPlag.Furthermore, using custom-built SCPDTs for this evaluation is also a sourceof bias. However, the implementation of these tools is as their name suggests,‘naive’. There is very little code written for these tools, and very little room toincrease their robustness dis-proportionally to the available academic SCPDTs.They were implemented by re-using and wrapping existing libraries and algorithmsinto command-line applications. This is with the exception of the Graph ED tool,that uses its own implementation of a PDG and associated edit distance algorithm,but both are still very simple and naive re-applications of existing techniques. Ingeneral, the only robustness to plagiarism-hiding modiﬁcation each tool aﬀords isthat intrinsically gained through the respective program representations utilisedin each naive SCPDT (e.g. token-based tools being robust to cosmetic changes andrenaming identiﬁers). There is little to no intentional optimisation of these tools.8.2 Authenticity & Correctness of Test DataThe use of synthetic test data is a potential threat to this evaluation. The generatedsimulated plagiarised variants are not real cases of plagiarism. Hence this evalua-tion does not conclusively demonstrate the robustness of the evaluated SCPDTsagainst real cases of undergraduate source code plagiarism, or real examples ofplagiarism-hiding source code modiﬁcations. Instead, this evaluation simply eval-uates the SCPDTs against source code modiﬁcations that are representative ofundergraduate plagiarisers. The utilised modiﬁcations are termed representativeas they have been referenced from literature as being observed to be used by under-graduate plagiarism. However, there are always uncertainties in how representativethe complexity of the applied source code modiﬁcations are in comparison to realcases of undergraduate source code plagiarism. This is in how pervasively modiﬁeda plagiarised work is by the plagiariser. In order to accommodate for this issue, the plagiarism-hiding modiﬁcations are applied using a sliding scale of transforma-tion and injection chances. This allows for evaluating SCPDT robustness againstlesser and more progressively transformed samples of simulated plagiarised works.It must also be acknowledged that this evaluation only utilises 14 source codetransformations, and 4 types of source code fragments injected. This is in contrast to the countless many types of source code transformations that may be applied,or combinations of source code fragments that may be injected. Hence, this workis limited to only evaluating robustness against the utilised source code modiﬁca-tions, and therefore, there may be many more complex source code modiﬁcationswith a more profound impact upon the evaluation of source code similarity thathas not been observed here.The performed evaluations also only evaluate robustness of the SCPDTs toplagiarism-hiding modiﬁcations. They do not evaluate the accuracy of tools (interms of precision and recall) in the presence of plagiarism-hiding modiﬁcations.The results of this experiment show that certain tools and approaches are morerobust to speciﬁc transformations when applied to the utilised data set. However,the results do not show that certain tools and approaches are more accurate. Whilethe generated data and test conditions try to simulate as closely as possible real-world situations where a student may have plagiarised, this is not a substitute forreal-world data. However, also as discussed, such data is generally not available insuﬃcient quantities for a comprehensive experiment. Hence, requiring the use ofsynthetic data in this experiment.There are also potential threats to the correctness of the generated data. Asdiscussed, SimPlag will only ensure that the generated simulated plagiarised vari-ant programs can be parsed (i.e. they are syntactically correct), and not that thevariants are compilable or functionally correct. However, plagiarism needs to beidentiﬁed irrespective of if a plagiarised work compiles and is functionally correct.Hence, this is not perceived to be an issue as this work is focused on evaluating theimpact of source code modiﬁcations on SCPDT robustness in the case they sourcecode modiﬁcations are applied, and not in the case that source code modiﬁcations can be applied.Across the data sets there is also a much higher chance than certain transfor-mations will be applied as there are more nodes of interest for their application.For example, a tAC (Add Comment) can be applied to any class, ﬁeld or methoddeclaration. This is compared to tFW ( for to while ) that can only be applied to for statements. This can lead to an un-even application of source code transfor-mations. Furthermore, the impact of a single application of a transformation doesvary in terms of how much code is modiﬁed. For example, a single reordering of awhole block statements and single swapping of operations both count as one trans-formation. Hence, these transformations are not always uniformly applied to eachderived variant, due to the randomness of application, and nature of the data sets;and as such this may impact upon the identiﬁcation of speciﬁc transformationsthat have the potential to have great impact upon the data sets.8.3 Comparison MeasuresPrecision and recall are commonly used metrics in the evaluation of SCPDTs, andcode similarity tool evaluations in general (Whale 1990a; Novak 2016; Ragkhitwet- sagul et al. 2018). Both metrics express the accuracy of a tool in identifying sim-ilar bodies of source code. However, as the purpose of this work is to measurerobustness and not accuracy; precision and recall have not been used in this eval-uation. This work is focused on measuring the decrease in similarity evaluatedby SCPDTs as they are exposed to more pervasively applied plagiarism-hiding valuating the robustness of source code plagiarism detection tools 57 source code modiﬁcations. This is used as a measure of robustness. Using accu-racy metrics would not contribute to the goal of this study as accuracy metricsare derived from a binary choice of if, or if not a SCPDT has detected an indica-tion of plagiarism. As a SCPDT does not directly detect plagiarism, but insteaddetects indications of plagiarism (Joy and Luck 1999), using such a method forthe evaluation of SCPDT robustness would not contribute to this work.Two metrics are used in the comparison of robustness: the quantitative absolutedecrease in similarity, and the comparative measure expressing the decrease insimilarity in ratio to the applied source code modiﬁcations (as the RCT and RCIscores). The ﬁrst metric is used to provide an overview of the impact of sourcecode modiﬁcations upon the evaluation of similarity. The second metric is usedto compare the SCPDTs. However, the interpretation of the second comparativemetric poses a threat to the validity of results in this work. As discussed, it isnot a normalised measure that can be used to compare SCPDTs on diﬀerent datasets. Identifying a ‘universal’ normalised comparison metric is outside of the scopeof this work. The RCT and RCI scores are simply used to correlate the impactof source code modiﬁcations to the evaluation of similarity, and then ranks thetools under speciﬁc experimental conditions on a single data set. Hence, it shouldbe considered that any RCT and RCI measurements can be compared betweenevaluations on diﬀerent data sets. Their use is restricted to the comparison andranking of tools on individual data sets only.When calculating the RCI, the increase in program size is used. This valueis measured as the LLOC, and is calculated as the non-block statement count ofa source ﬁle. LLOC is used as it is a commonly used metric for expressing thesize of code, that counts the executable statements in code without consideringformatting, declarations or whitespace. For the purpose of measuring robustness,any size metric could be used in this evaluation (e.g. raw lines of code, tokencounts, AST node counts, etc) as long as it is consistently used in the evaluationof robustness for all SCPDTs. Using a diﬀerent size measurement will simply scalethe RCI scores of each tool, according to the utilised size measurement. This maygive the impression of a greater/lesser robustness for a SCPDTs. However, the RCI(and RCT) scores are relative comparison measures, and not absolute comparisonmeasures. Hence, using a diﬀerent size value will not change the relative robustnessranking of the SCPDTs.8.4 Performance of Naive and Academic ToolsThroughout the performed experiments, the naive SCPDTs often performed onpar, if not better than the academic SCPDTs. It then raises the question, whatare the beneﬁts of using the mature academic SCPDTs, when in many cases naivere-applications of existing techniques can be applied with similar results.The most profound diﬀerence between these tools are the runtimes, and byextension the complexity of the implementations. Table 11 presents the average runtimes of each tool when comparing each program pair in each set of assignmentsubmissions. There are clear diﬀerences between the academic and naive tools inthe average runtime of their approaches. JPlag is by far the fastest on averagefor non-string-based tools, while the naive token-based tools run approximately 10times slower. Hence, it is clear that the academic tools contain optimisations to Table 11

Average SCPDT runtimes per program pair in seconds.Tool AS1 AS2 AS3 AS4 AS5 AS6 AVGJPlag 0.01 0.01 0.02 0.02 0.01 0.35 0.07Plaggie 0.79 0.80 0.64 0.58 0.35 0.71 0.65Sim 0.50 0.11 0.07 0.09 0.35 0.35 0.25Sherlock-W 3.66 3.69 2.96 3.02 1.06 3.88 3.05Sherlock-S 0.02 0.06 0.01 0.02 0.01 0.01 0.02String ED 3.57 4.06 1.74 1.56 0.35 3.53 2.47String Tile 1.35 1.21 1.41 1.37 0.35 3.18 1.48Token ED 0.93 0.94 0.73 0.71 0.35 0.71 0.73Token Tile 0.92 0.91 0.79 0.75 0.35 0.71 0.74Tree ED 12.34 11.64 7.32 5.95 1.76 9.88 8.15Graph ED 3.61 4.39 3.00 2.89 0.71 3.18 2.96 improve eﬃciency when compared to the naive tools. Such optimisations are notpresent in the naive tool implementations.When considering the results of the experiments which implied the Graph EDtool has potential at being more robust than JPlag; it also has to be considered theruntime of the Graph ED tool. Being graph-based, it is expected that it will havea much higher complexity and hence runtime. However, in this case the Graph EDtool is approximately 42 times slower than JPlag. While the Graph ED tool couldfeasibly gain greater eﬃciency through optimisation (e.g. pruning unnecessarycomparisons), it raises the question if the potential for greater robustness is worththe substantially greater complexity of the approach.However, when comparing the academic and naive tools, the accuracy of thetools must also be considered. The performed experiments do not consider theaccuracy of the evaluated tools. However, it must be noted that in certain cir-cumstances a highly robust SCPDT may have poor accuracy in the detection ofplagiarism. For example, consider if a SCPDT is eager in measuring similarity,such that it reports a high similarity between unrelated programs. In such a case,the tool would have a poor false negative rate, and overall a poor accuracy. Hence,the results of these evaluations should not be considered to imply that when a toolis robust, it is accurate; and therefore, a naive tool that is more robust than anacademic tool is not necessarily more accurate than an academic tool. Evaluatingthe accuracy of the utilised SCPDTs is subject to future work.

Source code plagiarism is a well-explored topic in academia. Subsequently, thereexist many works which seek to evaluate or compare SCPDTs. For example, Whale (1990a); Verco and Wise (1996); Lancaster and Tetlow (2005); Flores et al. (2014);Ahadi and Mathieson (2019). However, a common theme in prior evaluations isthat they evaluate tools for accuracy in detecting plagiarised assignment submis-sions. This is through the measurement and comparison of the precision and recallof the tools. Similarly, there are a number of similar works in the domain of code valuating the robustness of source code plagiarism detection tools 59 clone detection which seek to compare tools. For example, Bellon et al. (2007);Roy et al. (2009); Svajlenko and Roy (2015); Walker et al. (2020). However, theseworks are again more focused on the evaluation of the accuracy of tools in identi-fying similar programs. There is no emphasis on measuring the robustness of toolsagainst speciﬁc source code modiﬁcations.In the performed experiments, the utilised SCPDTs are evaluated for robust-ness to plagiarism-hiding modiﬁcations. This is to measure the impact of applyingplagiarism-hiding modiﬁcations upon the evaluated similarity of SCPDTs. To theauthors knowledge, this is the ﬁrst work that speciﬁcally compares SCPDTs byrobustness to plagiarism-hiding modiﬁcations that are representative of under-graduate programmers. However, there are three other works with similarities inthe theme of evaluating source code similarity tools. Ko et al. (2017) evaluated theperformance of COAT (a code obfuscation tool) at fooling 4 SCPDTs (Moss, JPlag,Sim and Sherlock). However, this evaluation was against only 8 transformations,many of which are not representative of undergraduate programmers. Further-more, there is no focus on measuring the impact of the transformations upon theSCPDTs, only the measurement of tool accuracy. Ragkhitwetsagul et al. (2018)compare 30 diﬀerent tools and techniques in their accuracy in evaluating similarityin pervasively modiﬁed source code. While similar to this work, Ragkhitwetsagulet al. are focused on the accuracy of tools in detecting code cones. Furthermore,their generation of pervasively modiﬁed source code is enabled by Java byte-codeobfuscators and decompiler tools. Such modiﬁcations to source code are not neces-sarily representative of undergraduate programmers, and there is no measurementof the impact of speciﬁc transformations. Schulze and Meyer (2013) evaluate codeclone detection tools for robustness against code obfuscations. Schulze and Meyerapplied 5 obfuscations to source code, which could be considered to be representa-tive of plagiarism. However, they only evaluated 3 tools (one of which was JPlag),and the focus of this work was for code clone detection. Hence, while this work issimilar, it was not performed at the same scale as the evaluations in this article;and is focused on measuring the accuracy of tools, not measuring the robustness oftools to transformation. The evaluations performed here are a partial extension ofprevious work in Cheers et al. (2020). The prior work was focused on showing thatexisting SCPDTs can be fooled with pervasive plagiarism-hiding transformations.While the prior work shares a common theme to this work in evaluating SCPDTs,they are distinct.The performed evaluations are based on test data generation with SimPlag.There are two other works which enable test data generation for code similarityevaluations: COAT (Ko et al. 2017) and ForkSim (Svajlenko et al. 2013). COATis the most similar to SimPlag as it is intended for use in plagiarism detection.However, it only implements 8 obfuscations, only supports C source code, anddoes not appear to have been release for reuse. ForkSim implements an injection/-mutation framework to simulate software development activities. This is designedfor use in code clone detection activities. The injection capabilities of this toolcan be used to simulate cases of verbatim source code copying. However, it only implements basic transformations (referred to as mutations in ForkSim) to sourcecode, typically additions, deletions or substitutions. The authors of this work havealso presented SPPlagiarise (Cheers et al. 2019), a similar tool to SimPlag whichenables the generation of semantics-preserving variants of a base program. Thereis overlap in the applied transformations of SimPlag and SPPlagiarise, however,

SPPlagiarise emphasises maintaining the correctness of the base program and assuch cannot apply some of the transformations used in this work (e.g. shuﬄingstatements) without signiﬁcant re-engineering.The performed evaluations are facilitated by the PrEP evaluation pipelinepresented as part of this work. There exists one similar tool proposed by Cebrianet al. (2009), which seeks to benchmark plagiarism detection tools. This is throughthe automatic generation of test cases for comparison. However, it was designedfor the APL2 programming language, which is not known to be supported by anycommonly available SCPDT. A similar evaluation pipeline exists for code clone de-tection. BigCloneBench (Svajlenko and Roy 2015) provides an evaluation pipelineand ground truth data set for comparing code clone detection tools. However, Big-CloneBench is designed for code clone detection tasks, and as previously discussed,in SCPD it is diﬃcult to obtain a ground-truth evaluation data set. Hence PrEPintegrates SimPlag for the generation of test data.

10 Conclusion & Future Work

In this article the robustness of 11 SCPDTs to plagiarism-hiding modiﬁcationshave been evaluated. This was performed through two evaluations that ﬁrstly,evaluated robustness to source code transformations, and secondly, evaluated ro-bustness to source code injection. The results of these evaluations demonstratethat while in many cases the evaluated SCPDTs are robust to plagiarism-hidingmodiﬁcations, there are speciﬁc source code transformations in which the evaluatedSCPDTs are vulnerable to. This is speciﬁcally to the application of transformationsthat apply ﬁne-grained modiﬁcations to the structure of a program. For example,reordering statements, reordering members, swapping expression operand orders,mapping statements to semantics equivalents, and splitting statements. Applyingsuch transformations change the structure of the source code, and often resultedin a large impact on the evaluation of program similarity. Hence, for these trans-formations in particular the evaluated tools mostly did not show a high degree ofrobustness.Overall, the results of the evaluations imply that all non-string-based SCPDTsshow comparatively good robustness to plagiarism-hiding source code modiﬁca-tions, and as such are not greatly impacted upon by such modiﬁcations. However,when source code modiﬁcations are most pervasively applied, the results of theseexperiments demonstrate that the tool JPlag is the most robust to the evalu-ated plagiarism-hiding source code transformations, while Plaggie is most robustagainst the injection of source code fragments. The results of the performed eval-uations also suggest there is beneﬁt in the use of evaluating program similaritywith PDGs to provide robustness to pervasive applications of plagiarism-hidingmodiﬁcations. This is attributed to while tools such as JPlag measure the sim-ilarity of the structure of two programs, PDGs measure the semantic similarityof programs; hence they are more robust to the applied structural modiﬁcations that are representative of undergraduate source code plagiarism. The results ofthe performed evaluations can be summarised as:1. String-based tools show poor robustness to any modiﬁcations;2. Non-string-based approaches demonstrate satisfactory robustness to modiﬁca-tions in typical cases of source code plagiarism; valuating the robustness of source code plagiarism detection tools 61

3. JPlag shows the greatest robustness to the evaluated plagiarism-hiding trans-formations;4. Plaggie shows the greatest robustness to the injection of fragments of sourcecode; and5. PDG-based tools provide indications of greater robustness against pervasivelymodiﬁed source code.Three directions of future work have been identiﬁed. Firstly, this work is in-tentionally limited to the evaluation of SCPDTs. However, there exist many othertools which evaluate source code similarity for other domains, such as code clonedetection. It would be interesting to evaluate the robustness of code clone detec-tion tools to common source code modiﬁcations, and compare the results withSCPDTs. However, this would have to be with a revised experimental methodwhich is fair on both tool types. Secondly, there are many more source code trans-formations which could be evaluated than the 14 used in Section 5. Speciﬁcally,it would be interesting to evaluate the eﬀect of much more invasive source codetransformations which change the structure and semantics of a program, but retainthe original behaviour. Finally, the evaluations indicated that PDGs show robust-ness in the presence of pervasively modiﬁed source code. It would be interestingto perform a much more in-depth exploration of semantic-based, and potentiallybehaviour-based methods of evaluating source code similarity in the presence ofpervasive modiﬁcations. For example, to see if such approaches can provide greaterrobustness to pervasive plagiarism-hiding modiﬁcations.

References

Ahadi A, Mathieson L (2019) A comparison of three popular source code similarity tools fordetecting student plagiarism. In: Proceedings of the Twenty-First Australasian ComputingEducation Conference, Association for Computing Machinery, New York, NY, USA, ACE’19, p 112–117, DOI 10.1145/3286960.3286974Ahtiainen A, Surakka S, Rahikainen M (2006) Plaggie: Gnu-licensed source code plagiarismdetection engine for java exercises. In: Proceedings of the 6th Baltic Sea Conference onComputing Education Research: Koli Calling 2006, Association for Computing Machinery,New York, NY, USA, Baltic Sea ’06, p 141–142, DOI 10.1145/1315803.1315831Allyson FB, Danilo ML, Jos´e SM, Giovanni BC (2019) Sherlock n-overlap: Invasive normaliza-tion and overlap coeﬃcient for the similarity analysis between source code. IEEE Trans-actions on Computers 68(5):740–751Anjali V, Swapna T, Jayaraman B (2015) Plagiarism detection for java programs withoutsource codes. Procedia Computer Science 46:749 – 758, DOI https://doi.org/10.1016/j.procs.2015.02.143, proceedings of the International Conference on Information and Com-munication Technologies, ICICT 2014, 3-5 December 2014 at Bolgatty Palace & IslandResort, Kochi, IndiaAnzai K, Watanobe Y (2019) Algorithm to determine extended edit distance between programcodes. In: 2019 IEEE 13th International Symposium on Embedded Multicore/Many-coreSystems-on-Chip (MCSoC), pp 180–186Baxter ID, Yahin A, Moura L, Sant’Anna M, Bier L (1998) Clone detection using abstractsyntax trees. In: Proceedings of the International Conference on Software Maintenance,IEEE Computer Society, Washington, DC, USA, ICSM ’98, pp 368–377Bellon S, Koschke R, Antoniol G, Krinke J, Merlo E (2007) Comparison and evaluation ofclone detection tools. IEEE Transactions on Software Engineering 33(9):577–591Burd E, Bailey J (2002) Evaluating clone detection tools for use during preventative main-tenance. In: Proceedings. Second IEEE International Workshop on Source Code Analysisand Manipulation, pp 36–432 Hayden Cheers et al.Cebrian M, Alfonseca M, Ortega A (2009) Towards the validation of plagiarism detectiontools by means of grammar evolution. IEEE Transactions on Evolutionary Computation13(3):477–485Chae DK, Ha J, Kim SW, Kang B, Im EG (2013) Software plagiarism detection: A graph-basedapproach. In: Proceedings of the 22nd ACM International Conference on Information &Knowledge Management, Association for Computing Machinery, New York, NY, USA,CIKM ’13, p 1577–1580, DOI 10.1145/2505515.2507848Cheers H, Lin Y, Smith SP (2019) Spplagiarise: A tool for generating simulated semantics-preserving plagiarism of java source code. In: 2019 IEEE 10th International Conference onSoftware Engineering and Service Science (ICSESS), pp 617–622Cheers H, Lin Y, Smith SP (2020) Detecting pervasive source code plagiarism through dynamicprogram behaviours. In: Proceedings of the Twenty-Second Australasian Computing Edu-cation Conference, Association for Computing Machinery, New York, NY, USA, ACE’20,p 21–30, DOI 10.1145/3373165.3373168Chen R, Hong L, Chunyan L¨u C, Deng W (2010) Author identiﬁcation of software source codewith program dependence graphs. In: 2010 IEEE 34th Annual Computer Software andApplications Conference Workshops, pp 281–286Chen X, Francia B, Ming Li, McKinnon B, Seker A (2004) Shared information and programplagiarism detection. IEEE Transactions on Information Theory 50(7):1545–1551Cosma G, Joy M (2008) Towards a deﬁnition of source-code plagiarism. IEEE Transactionson Education 51(2):195–200Cosma G, Joy M (2012) An approach to source-code plagiarism detection and investigationusing latent semantic analysis. IEEE Transactions on Computers 61(3):379–394Curtis G, Popal R (2011) An examination of factors related to plagiarism and a ﬁve-yearfollow-up of plagiarism at an australian university. International Journal for EducationalIntegrity 7(1):30–42, DOI 10.21913/IJEI.v7i1.742Faidhi J, Robinson S (1987) An empirical approach for detecting program similarity and pla-giarism within a university programming environment. Computers & Education 11(1):11– 19, DOI https://doi.org/10.1016/0360-1315(87)90042-XFerrante J, Ottenstein KJ, Warren JD (1987) The program dependence graph and its use inoptimization. ACM Trans Program Lang Syst 9(3):319–349, DOI 10.1145/24039.24041Flores E, Rosso P, Moreno L, Villatoro-Tello E (2014) On the detection of source code re-use. In: Proceedings of the Forum for Information Retrieval Evaluation, Association forComputing Machinery, New York, NY, USA, FIRE ’14, p 21–30, DOI 10.1145/2824864.2824878Freire M, Cebri´an M, del Rosal E (2007) AC: an integrated source code plagiarism detectionenvironment. CoRR abs/cs/0703136, cs/0703136

Gitchell D, Tran N (1999a) Sim: A utility for detecting similarity in computer programs. In:The Proceedings of the Thirtieth SIGCSE Technical Symposium on Computer ScienceEducation, Association for Computing Machinery, New York, NY, USA, SIGCSE ’99, p266–270, DOI 10.1145/299649.299783Gitchell D, Tran N (1999b) Sim: A utility for detecting similarity in computer programs.SIGCSE Bull 31(1):266–270, DOI 10.1145/384266.299783Granzer W, Praus F, Balog P (2013) Source code plagiarism in computer engineering courses.Journal on Systemics, Cybernetics and Informatics 11(6):22–26Grune D, Huntjens M (1989) Het detecteren van kopie¨en bij informatica-practica. Informatie(in Dutch) 31(11):864–867Halstead MH (1977) Elements of Software Science (Operating and Programming SystemsSeries). Elsevier Science Inc., New York, NY, USAJadalla A, Elnagar A (2008) Pde4java: Plagiarism detection engine for java source code: Aclustering approach. Int J Bus Intell Data Min 3(2):121–135, DOI 10.1504/IJBIDM.2008.020514Jhi Y, Wang X, Jia X, Zhu S, Liu P, Wu D (2011) Value-based program characterization andits application to software plagiarism detection. In: 2011 33rd International Conference onSoftware Engineering (ICSE), pp 756–765Jones E (2001) Metrics based plagarism monitoring. Journal of Computing Sciences in Colleges16:253–261Joy M, Luck M (1999) Plagiarism in programming assignments. IEEE Transactions on Edu-cation 42(2):129–133valuating the robustness of source code plagiarism detection tools 63Kapser C, Godfrey MW (2003) Toward a taxonomy of clones in source code: A case study. In:ELISA ’03, pp 67–78Karnalim O (2016) Detecting source code plagiarism on introductory programming courseassignments using a bytecode approach. In: 2016 International Conference on InformationCommunication Technology and Systems (ICTS), pp 63–68Ko S, Choi J, Kim H (2017) Coat: Code obfuscation tool to evaluate the performance of codeplagiarism detection tools. In: 2017 International Conference on Software Security andAssurance (ICSSA), pp 32–37Kolmogorov A (1998) On tables of random numbers. Theoretical Computer Science 207(2):387– 395, DOI https://doi.org/10.1016/S0304-3975(98)00075-9Kustanto C, Liem I (2009) Automatic source code plagiarism detection. In: 2009 10th ACISInternational Conference on Software Engineering, Artiﬁcial Intelligences, Networking andParallel/Distributed Computing, pp 481–486Lancaster T, Tetlow M (2005) Does automated anti-plagiarism have to be complex? evaluatingmore appropriate software metrics for ﬁnding collusionLi X, Zhong XJ (2010) The source code plagiarism detection using ast. In: 2010 InternationalSymposium on Intelligence Information Processing and Trusted Computing, pp 406–408Liu C, Chen C, Han J, Yu PS (2006) Gplag: Detection of software plagiarism by programdependence graph analysis. In: Proceedings of the 12th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining, Association for Computing Machinery,New York, NY, USA, KDD ’06, p 872–881, DOI 10.1145/1150402.1150522Luo L, Ming J, Wu D, Liu P, Zhu S (2017) Semantics-based obfuscation-resilient binary codesimilarity comparison with applications to software and algorithm plagiarism detection.IEEE Transactions on Software Engineering 43(12):1157–1177Martins VT, Fonte D, Henriques PR, da Cruz D (2014) Plagiarism Detection: A Tool Sur-vey and Comparison. In: Pereira MJV, Leal JP, Sim˜oes A (eds) 3rd Symposium on Lan-guages, Applications and Technologies, Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik,Dagstuhl, Germany, OpenAccess Series in Informatics (OASIcs), vol 38, pp 143–158, DOI10.4230/OASIcs.SLATE.2014.143Mozgovoy M (2006) Desktop tools for oﬄine plagiarism detection in computer programs. In-formatics in Education 5(1):97–112Novak M (2016) Review of source-code plagiarism detection in academia. In: 2016 39th In-ternational Convention on Information and Communication Technology, Electronics andMicroelectronics (MIPRO), pp 796–801Novak M, Joy M, Kermek D (2019) Source-code similarity detection and detection tools usedin academia: A systematic review. ACM Trans Comput Educ 19(3), DOI 10.1145/3313290Ottenstein KJ (1976) An algorithmic approach to the detection and prevention of plagiarism.SIGCSE Bull 8(4):30–41, DOI 10.1145/382222.382462Parker A, Hamblen JO (1989) Computer algorithms for plagiarism detection. IEEE Transac-tions on Education 32(2):94–99Pawlik M, Augsten N (2015) Eﬃcient computation of the tree edit distance. ACM TransDatabase Syst 40(1), DOI 10.1145/2699485Pawlik M, Augsten N (2016) Tree edit distance: Robust and memory-eﬃcient. InformationSystems 56:157 – 173, DOI https://doi.org/10.1016/j.is.2015.08.004Pierce J, Zilles C (2017) Investigating student plagiarism patterns and correlations to grades.In: Proceedings of the 2017 ACM SIGCSE Technical Symposium on Computer ScienceEducation, Association for Computing Machinery, New York, NY, USA, SIGCSE ’17, p471–476, DOI 10.1145/3017680.3017797Pike R (n.d.) Sherlock Plagiarism Detector. URL https://academic.oup.com/comjnl/article-pdf/39/9/741/993714/390741.pdfhttps://academic.oup.com/comjnl/article-pdf/39/9/741/993714/390741.pdf