The Adverse Effects of Code Duplication in Machine Learning Models of Code
TThe Adverse Effects of Code Duplicationin Machine Learning Models of Code
Miltiadis Allamanis [email protected]
Microsoft ResearchCambridge, UK
Abstract
The field of big code relies on mining large corpora of codeto perform some learning task towards creating better toolsfor software engineers. A significant threat to this approachwas recently identified by Lopes et al. [19] who found alarge amount of near-duplicate code on GitHub. However,the impact of code duplication has not been noticed by re-searchers devising machine learning models for source code.In this work, we explore the effects of code duplication on ma-chine learning models showing that reported performancemetrics are sometimes inflated by up to 100% when testingon duplicated code corpora compared to the performanceon de-duplicated corpora which more accurately representhow machine learning models of code are used by softwareengineers. We present a duplication index for widely useddatasets, list best practices for collecting code corpora andevaluating machine learning models on them. Finally, werelease tools to help the community avoid this problem infuture research.
CCS Concepts • Computing methodologies → Machinelearning ; •
Software and its engineering → Softwarenotations and tools . Keywords duplication, dataset collection, machine learn-ing, big code, code naturalness
Machine learning models of source code have recently re-ceived great attention from the research community. At theintersection of the research fields of software engineering,programming languages, machine learning and natural lan-guage processing, multiple communities have been broughttogether into the field of “Big Code” or “code naturalness”with many fruitful results [1]. Commonly, research in thisarea relies on large corpora of code which can be used astraining and test sets, allowing machine learning methods tolearn and probabilistically reason about coding practice at alarge scale. The goal is to use the learned models to provideuseful tools to software engineers.However, there is a looming crisis in this newly-foundedarea, caused by a disproportionately large amount of codeduplication. This issue — first observed by Lopes et al. [19] —refers to the fact that multiple file-level (near-)clones appearin large corpora of code, such as those mined from GitHub repositories. This is because software engineers often copy— partially or entirely — files from other projects [11, 19].Despite the findings of Lopes et al. [19], the research commu-nity has not yet investigated how and when code duplicationnegatively affects its research, the machine learning modelsit devises, and the practical tools it creates. The core issuearises from the fact that identical or highly similar files ap-pear both in the training and test sets that are used to trainand evaluate the machine learning models.In this work, we first describe the impact that code dupli-cation can have on machine learning models. Although notall applications of machine learning models are affected bycode duplicates, a large majority of them is. We discuss thebiases introduced when evaluating models under duplica-tion and show that duplication can cause the evaluation tooverestimate the performance of a model compared to theperformance that actual users of the model observe. Then,we replicate the work of Lopes et al. [19] across ten corporathat have been used in “big code” research and we measurethe impact of duplication across datasets and machine learn-ing models showing that the performance observed by a useris up to 50% worse compared to reported results. Althoughthis paper does not present any results or ideas that would beunexpected to a statistician or a machine learning expert, wehope that it will help programming language, software engi-neering and machine learning researchers better understandthe issue of code duplication for machine learning on code byclearly illustrating its impact. At the same time, we providetools and some best practices that can help overcome pitfallswhen researching machine learning methods that employsource code data. We hope that this paper contributes thefollowing: • an application-driven principle for deciding if within theapplication domain code corpus deduplication is needed(Section 2); • the theoretical basis of the effects of code duplication(Section 2) and a demonstration of the effects of codeduplication on machine learning models of source code(Section 4); • an open-source, cross-platform tool that detects near-duplicates in C a r X i v : . [ c s . S E ] A ug iltiadis Allamanis • a set of suggested best practices to mitigate the codeduplication problem for machine learning models of code(Section 5). Code duplication refers to the idea that a large snippet ofcode appears multiple times with no or small differenceswithin a corpus of code. Duplicates are a relatively smallsubset of code clones [25] — a well-studied field of softwareengineering. The existence of duplicates was noticed muchearlier [27] but their negative effect became significantlymore noticeable due to recent advancements that allowedthe collection of large code corpora [19]. In this paper, we arespecifically interested in illustrating the effects of code dupli-cation on machine learning models of code . This endeavorsets different parameters for searching, understanding andclassifying code duplication. To understand the effects ofduplicates, we first need to discuss the practical applicationsof machine learning models for code.Why do we want to train machine learning models onsource code? At a high-level, the goal is to train modelson existing code, such that the learned models capture thestatistical properties of some particular aspect of codingpractice, which can then be useful within a tool used by asoftware engineer. Some examples of recently researchedmodels include: • code completion models [14, 15, 20, 24] aiming to assistcode construction in an editor when a developer is writ-ing new code. Such models are widely used in practicetoday. • Type prediction models [13, 23] where the goal is toinfer (or provide probabilistic hints for) the types of new,previously untyped, programs ( e.g. in JavaScript) ; • code summarization [3, 5, 7, 16] where the goal is tosummarize some code into a short natural language ut-terance.In most applications, like in the aforementioned examples,the goal is to use trained models to provide recommenda-tions and insights on new and unseen code when the softwareengineer is creating or maintaining it. Essentially, this neces-sitates that machine learning models generalize well to newsource code or — in statistical machine learning terms — to faithfully model the true distribution of the data as it will be ob-served by the particular use case of the tool . As we will discusslater in this section, in order for a machine learning model togeneralize to the true data distribution, it needs to be trainedon data independently drawn from that distribution. Codeduplicates commonly violate that.Furthermore, the true data distribution depends on the tar-get application. Different applications of machine learning We use the terms “duplicate” and “near-duplicate” interchangeably to referto code that is highly similar but not necessarily identical. models of code will tend to have different true data distri-butions. Therefore, before training any machine learningmodel of code, we should all ask “What is the distributionof the data that our machine learning component will need tooperate on?”
For example, for a token-level code completion modelthe true data distribution refers to the predicted next tokenthat the developer will actually type. It is thus reasonableto assume that duplicate code is not a part of the true datadistribution as a developer will copy-paste whole chunksrather than type duplicate code character-by-character. How-ever, there are other cases where code duplication is part ofthe true data distribution. For example, if we are interestedin deobfuscating code that contains a lot of copy-pasted li-braries/functions, then duplicates are part of the true datadistribution.The duplication issue arises because, in practice, it is veryrare for researchers to train their model and measure itsperformance by directly observing its use by engineers, i.e. the true data distribution. Instead, a common practice isto split any existing dataset into two parts: a training setthat is used to train the machine learning model and a testset where the performance of the model is measured. Andsince duplicated datasets are distributed differently fromnon-duplicated datasets the machine learning models learnto model a different probability distribution. This is becausemachine learning makes an important assumption: each ofthe data points need to be independent and identically dis-tributed (i.i.d) over the true distribution of data of the usecase. This is not an unreasonable assumption and is widelyand successfully used in machine learning and data miningresearch and practice [21, §7.3]. It is exactly this assumptionthat code duplication strongly violates for many of the usecases of machine learning models of code.In this paper, we make two assumptions. First, the true datadistribution of the target application contains no duplicates.Second, we assume that duplication happens only acrossfiles, similar to Lopes et al. [19]. This means that smalleramounts of code duplication, such as clones that span only afew lines, are not be considered duplicates. The last assump-tion addresses the possibility that the target use case of amachine learning-based software engineering tool containsa few lines of cloned code. For example, a type predictiontool may still be required to suggest types even when a fewlines of code have been copy-pasted. These assumptions arecentral to the thesis of this paper: As we will discuss later,particular use cases may allow for duplicates within the truedata distribution. The results presented in this paper doesnot affect them. Other use cases may need to consider addi-tional type of duplicates, such as smaller cloned snippets orfunctional (type IV) clones. The results presented here arestill valid for those cases and, most probably, the negativeeffects of code duplication would be more severe when abroader class of code duplicates needs to be considered. ffects of Code Duplication in Machine Learning Models of Code
Concepts and Definitions
Assume a dataset D of sourcecode files that is split into a training and a test set (Figure 1).We distinguish three types of duplicates: (1) “in-train” dupli-cates, i.e. files duplicated within the training set; (2) “in-test”duplicates, i.e. duplicates within the test set; and (3) “cross-set” duplicates, i.e. files that appear both in the training andtest sets. Duplication Bias
In machine learning, a measured quan-tity f , such as the loss function minimized during trainingor a performance ( e.g. accuracy) metric, is usually estimatedas the average of the metric computed uniformly over thetraining or test set(s) (because of the i.i.d. hypothesis). Specif-ically, the estimate of f over a dataset D = { x i } is computedas ˆ f = | D | (cid:213) x i ∈ D f ( x i ) . (1)Duplication biases this estimate because some x i will appearmultiple times. Specifically, we can equivalently transform D as a multiset X = {( x i , c i )} where c i ∈ N + is the numberof times that the sample x i is found in the dataset. Therefore,we can rewrite Equation 1 asˆ f = ( − d ) | X | (cid:213) x i ∈ X f ( x i ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) unbiased estimate ¯ f + d | D | − | X | (cid:213) x i ∈ X ( c i − ) f ( x i ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) duplication bias β (2)where d = | D |−| X || D | = (cid:205) c i −| X || D | is the duplication factor , where | X | is the number of unique x i in X . Thus d is the proportionof the samples in the dataset that are duplicated ( c i > f = ( − d ) ¯ f + dβ we seethat the larger the duplication factor d , the larger the effectof the duplication bias β . From a machine learning perspective, the duplication biasin the training loss causes a model to overweight some train-ing samples (the in-train duplicates). During testing, theduplication bias will skew the reported performance metric.Furthermore, we expect cross-set duplicates to artificiallyimprove any metric taking advantage of the fact that multi-ple samples that are seen during training also appear in thetest set, giving the illusion that the model generalizes, wherein fact it memorized duplicates.
To measure code duplication we need a method that de-tects (near) duplicate files along a large corpus of code. Aswe discussed in the previous section, we are interested infile-level duplication and thus we re-implement Sourcer-erCC’s [26] token-level duplication detection with minormodifications described next and release it under a permis-sive license. These simple modifications adapt SourcererCCto file-level duplicate detection, removing complexity that is
Training Set
Test Set in-test duplicates in-trainduplicates cross-set duplicates
Figure 1.
Schematic description of types of duplicates. Thedashed boxes indicate the subset of files that are duplicateswithin each set.required for general-purpose code clone detection and aresimilar to those discussed in Lopes et al. [19].
Detecting near-duplicates
Although detecting exact du-plicates is straightforward, this misses a substantial numberof near-exact matches that differ only in a few aspects. Toachieve this, we follow SourcererCC [26]: we tokenize eachfile and extract all identifier and literal tokens. For each file,we build two “fingerprints”, a set T and a multiset T of allthe identifiers and literals. We consider two files i and j to beduplicates, if the Jaccard similarities J ( T i , T j ) and J ( T i , T j ) are above the thresholds t and t respectively. In this work,we set t = . t = . https://github.com/Microsoft/near-duplicate-code-detector .It contains tokenizers for Java, JavaScript, C i.e. a file containing a validJSON per line) containing an id of each file ( e.g. its filepath)and a list of identifier and literal tokens within that file. Itreturns a JSON file with the clusters of near-duplicate files.We also provide a faster, but approximate Python tool thatworks on the same principles within the dpu-utils packageat https://github.com/Microsoft/dpu-utils . iltiadis Allamanis Table 1.
Duplication Statistics across Existing Corpora over all files (across any provided splits) with more than 20 identifierand literal tokens.
Name Relevant × × d (%) Average Median Files within Test (6:4 split)C † *We place one method per file, since the corpus is split across methods. † When the dataset is split across projects, as in the author provided split, this falls to 8.9%.
Duplication Statistics
Armed with a reasonable methodfor detecting duplication, we now report code duplicationstatistics for ten publicly available datasets that have beenused for machine learning on code. It should be noted thatfor the studied datasets all authors have taken significantsteps to remove exact file-level clones. However, this processmissed a large number of (near) duplicate files, that may differin minor aspects, such as whitespace, code comments andother small code modifications. Table 1 reports the results.We note that for the JavaScript-150k dataset our tool was ableto process only 112k files and therefore we report results onthose files. The rest of the files are ignored. The results showthat in many datasets, a substantial proportion of the datasetcontains duplicated code. Note that these statistics are whendatasets are split into different folds (chunks) across files.When splitting across projects, this percent is most oftenreduced. For example, splitting the Java-Large dataset acrossprojects, following the split provided by Alon et al. [5], 8.9%of the test set is made of cross-set duplicates (compared to theaverage of 24.1% when splitting across files). This suggeststhat splitting across projects — when possible — is a helpfulstrategy.As expected, smaller datasets, such as those collected overa small and curated set of projects suffer less from dupli-cation. The Concode dataset [17] seems to be the one suf-fering the most from duplication, by having about 68.7% ofits methods be duplicates. However, it should be appreci-ated that Concode and the Python docstring datasets aredatasets where each sample is a single function, rather thana full source code file. If we transform the other datasets,such that each file contains a single function or a smallersnippet, their duplication statistics might also worsen. Notethat once the data is split into training-test sets, the percentof cross-set duplicates is smaller than the full dataset du-plication factor, since a noticeable proportion of duplicatesbecome in-train or in-test duplicates. Finally, we note that This is because the esprima parser failed to parse these files. the duplication in all datasets is significantly smaller thanthat reported by Lopes et al. [19]. This should be attributedto the fact that the corpus collected by Lopes et al. [19] is or-ders of magnitude larger than any of the datasets in Table 1.Authors of the datasets discussed here made efforts to dedu-plicate and filter the collected corpora by removing most lowpopularity projects and some number of exactly duplicatedfiles. We release the duplicates files at https://ieee-dataport.org/open-access/deduplication-index-big-code-datasets
Wehope that these lists can be used as dataset duplication indexin future work.
Human Evaluation
SourcererCC makes some approxi-mations to make the search computationally efficient. Thisraises the question about its precision. The author of this pa-per inspected 100 random pairs of duplicates for the Javascript-150k dataset [22] and 100 random pairs from the Java-Largedataset [5] and annotated each pair as a true or false positive.Overall, the duplicate detection achieves perfect precisionfor both datasets. This is to be expected as SourcererCC is awell-validated method and works very well for the specialand relatively easy case of detecting file-level duplicates.Looking at the duplicates, we make a few qualitative, em-pirical observations. First, we observe that a large majority ofduplicates share the same file name. For the JavaScript-150k,the majority of near-duplicates is of two kinds: (a) differentversions of the same file (b) configuration-like files that differmostly on the configuration values. In contrast, in the Java-Large dataset we find more exact clones, duplicates of thesame file but of a different version and boilerplate code. Forthe C ffects of Code Duplication in Machine Learning Models of Code
Table 2.
Terminology for Measuring Performance based onKinds of Duplicates in Training and Test Sets
Training Test Setno dups w/ cross-set dups w/ all dupsBiased Unbiased Test Cross-Set Biased Fully BiasedUnbiased Fully Unbiased – –
So far, we have established that code duplication can — inprinciple — have adverse effects to the way machine learningmodels of code are trained and evaluated. But is this actuallythe case? Analytically measuring the effect of duplicationon machine learning models in a generalized way is not pos-sible. This is because machine learning models differ widelyin their characteristics and we expect different models andtasks to be affected differently by code duplication. To em-pirically illustrate the impact of code duplication, we createexperimental settings that illuminate separate aspects of theproblem. In Section 4.1 and Section 4.2 we focus on codeautocompletion through language modeling. This allows usto do an in-depth case study of a single model and a fewfactors of variation. Then in Section 4.3 we train state-of-the-art models on other tasks. In all cases, we assume a random50-10-40 train-validation-test split over the dataset. We usethe validation set to evaluate training decisions without ex-posing the model to the test set — a standard practice inmachine learning. For example, in neural networks wherean algorithm iteratively optimizes the model parameters, wepick the parameters for the iteration that achieves the bestperformance on the validation set. If a model does not usea validation set, we merge the validation samples into thetraining set.We note that this section does not attempt to be exhaustivebut to replicate some recent work and study the effects ofduplication. Our goal is to merely elucidate how these effectsare demonstrated for the particular case of machine learningmodels of source code, demonstrate that duplication should not be an afterthought when designing and evaluating suchmodels and help us distill meaningful best practices.
Terminology
In the absence of existing terms, we intro-duce a few new terms and annotate them with a mnemonicsymbol to help the reader. Given a training-test split andby interpreting Equation 2, we have two possible types oftraining: • Unbiased Training
All duplicates are removed ( c i = , ∀ i ) and an unbiased loss function ¯ f is employed duringtraining; • Biased Training
All in-train duplicates are kept and thebiased loss function ˆ f is used. Since most existing work does not adequately de-duplicate its datasets, it employsbiased training.We now turn our attention to the testing terminology. Withina testset we distinguish two types of duplicates: the cross-set duplicates, and the in-test duplicates (Figure 1). Thisleads to four types of metrics, summarized in Table 2 anddiscussed next. The mnemonic symbols can be interpretedas Venn diagrams of the training and test sets. When a setcontains duplicates it is shaded (indicating bias on that set),otherwise it is left blank. Finally, we note that when weremove duplicates, we keep exactly one file from each clusterof near-duplicates, such that any duplicate file is used exactlyonce ( c i = • Fully Unbiased that represents an “ideal world”, whereall duplicates are removed both from training and testsets and the training and test sets are completely disjoint,allowing us to perform unbiased training and testing. • Unbiased Test that represents the performance whenthe test set contains no duplicates. This is equivalentto the performance observed by a user who is using amachine learning model under the true data distribution,but the model has been trained in a biased way. • Cross-set Biased Test which is the performance mea-sured when performing a biased training and using a testset that only contains cross-set duplicates, but no in-testduplicates. • Fully Biased Test where training and testing happenson the duplicated (original) dataset. This is the metric thatis reported by existing work. Compared to the cross-setbiased test ( ) this metric is additionally biased by the in-test duplicates. Because this bias is arbitrary, it inhibits usfrom measuring the exact effect of code duplication. Forthis reason, we do not report these metrics ( ), but notethat empirically it is always very close to the cross-setbiased test metrics ( ).It should be noted that for estimating the impact of duplica-tion on machine learning models it is technically incorrectto directly compare the fully unbiased performance ( ) withthe unbiased test ( ) to measure the effect of code duplica-tion. In contrast, comparison between the cross-set biased( ) and unbiased test ( ) is technically correct. This is be-cause when training a model on (slightly) different datasets,there is no method that can distinguish between a model’scapacity to learn from more (but duplicated) data and theeffect of duplication. In practice we observe negligible dif-ferences between deduplicated ( ) and unbiased testing ( )and we report both. vs.
Unbiased Performance
As we discussed in Section 2, code duplication can result inmeasuring better performance compared to the one that auser would actually observe, negatively impacting the user’s iltiadis Allamanis
Table 3.
Impact of Duplicates on Evaluation Performanceon a simple Language Modeling Task on the reshuffled andslightly reduced JavaScript-150k [22] dataset and standarddeviations. PerformanceMetric ∆ ( , ) Acc (%) 49.1 ± ± -10.9% 49.2 ± Acc-ID (%) 8.6 ± ± -51.4% 8.3 ± MRR 0.674 ± ± -5.1% 0.674 ± MRR-ID 0.136 ± ± -39.3% 0.132 ± PPL 9.4 ± ± +25.3% 9.4 ± PPL-ID 76.1 ± ± +37.4% 82.3 ± experience. In this and next section, we focus on the effectsof duplication on a single task, namely code autocompletionwith language models. By focusing on a single task and modelwe can do a deep-dive on various aspects of code duplicationand illustrate subtle effects. Later, in Section 4.3 we measurethe impact of code duplication on other models and on othertasks. Autocompletion via Language Modeling has been ex-tensively studied both in natural language and in sourcecode. The goal of language models is to capture the statisti-cal characteristics of a language such that the output appearsto be “natural”. Language models have been used for auto-completion [14, 15, 20, 24] and it would be unreasonable toassume that the true distribution of this particular use casescontains duplicate code.To demonstrate the effects of code duplication we employa simple, yet powerful neural language model. The goal isto show how even relatively simple models are severely im-pacted by duplication and draw observations that generalizeto other models. We follow the early work of Bengio et al.[8] for token-level language modeling. Our neural languagemodel (NLM) is described as P ( t i ) = softmax ( E o σ ( W c [ E i h ( t i − ) . . . E i h ( t i − c )]) + b ) (3)where E o ∈ R | V |× K and E i ∈ R D ×| V | are the output and inputembedding matrices of tokens, W c ∈ R K × cD is a matrix, b is a bias vector, and h () is a function that takes a token andconverts it to a one-hot vector. All parameters are learned.We train our model to minimize the empirical cross-entropyon the training set, and pick the model that achieves thebest performance on the validation set. For simplicity, in thiswork we set K = D . Throughout this section, we set D = V , we use the top 10k most frequent tokens. All results areaveraged across 5 runs on random splits of the data. Performance
To accurately measure the impact of dupli-cation we need to be able to make a fair comparison on the evaluated results. To achieve this, we replicate the condi-tions of existing work, i.e. we perform biased training on ourmodels. We then compute the unbiased ( ) and cross-setbiased ( ) performance metrics. Table 3 shows the mea-sured effect of duplication on the reshuffled and slightlysmaller JavaScript-150k dataset. Specifically, it highlightsthe % relative difference between the unbiased-test ( ) andcross-set biased ( ) metrics, which can directly measure theeffect of code duplication on the metrics. We also report thefully-unbiased metrics ( ). The metrics computed are (a) theaccuracy of correctly predicting the next token (Acc; higheris better), (b) the mean reciprocal rank (MRR; higher is better)over the tokens and (c) the perplexity (PPL; lower is better)assigned by the neural language model. Unknown tokens arecounted as incorrect when computing accuracy and MRR.We also compute focused metrics on identifiers since theyhave been proven to be the hardest to predict [4, 9, 20]. Wenote that we also computed the fully biased ( ) metrics andon average, the NLM’s performance is similar to the cross-setbiased ( ) performance. This is expected, since the in-testbias is mostly random.Based on the results, we notice that all metrics are affectedto a different extent by code duplication. The relative dif-ference ( ∆ ( , ) ) ranges from a few percentage points tohalved performance. This suggests the seriousness of thecode duplication problem. Furthermore, we observe that theidentifier-related metrics are those that are more severelyaffected by code duplication. This is expected, since codeduplication makes identifiers, which would otherwise appearsparsely, appear more frequently and predictably.Thus, it should be appreciated that not all metrics andtasks are equally affected by code duplication. For example,if an application requires predicting code’s non-identifiertokens ( e.g. as in Campbell et al. [10]), duplication wouldhave a much smaller effect compared to an autocompletionapplication for predicting identifiers.
Duplication has an observable impact on the performanceof machine learning models of source code. However, not allmodels are impacted in the same way. Indeed, some modelsmay be more prone to memorizing code duplicates thanothers. Since we cannot directly compare the capacity ofdifferent models, we perform a case study on the NLM modeland illustrate how varying its learning capacity causes theNLM to be affected differently by duplication.Figure 2 plots the NLM accuracy of predicting tokens (solidlines) or only identifiers (dashed lines). As a proxy for mea-suring the capacity of the model, we vary the dimensionality D of the vector representations; a common proxy for modelcapacity in the machine learning literature. Although thereare other methods to increase the capacity of the model ( e.g. by adding more layers), increasing the dimensionality is a ffects of Code Duplication in Machine Learning Models of Code Model Capacity D T o k e n A cc u r a c y ( % ) Figure 2.
The impact of code duplication on the NLM withdifferent capacity trained on JavaScript-150k. The solid linesshow the accuracy of the NLM model when predicting all to-kens, whereas the dashed lines show the accuracy of predict-ing only identifiers. Blue lines indicate the cross-set biasedaccuracy, and black ones show the unbiased test accuracy.The larger the capacity of the model, the more severe theimpact of code duplication (red shaded area).reasonable option for exploring the effect of code duplication.The shaded (red) area in Figure 2 shows, as expected, that the(negative) effect of duplication increases as model capacityincreases. This can be attributed to the fact that additionalcapacity is used to memorize duplicated code. Therefore, weobserve that models that have larger capacity tend to be moreheavily affected by code duplication .This suggests an additional and important observation:
Comparison of different models under code duplication may notbe indicative of their real performance . This is because somemodels, having more capacity, can take better “advantage” ofcode duplication and report improved results only becausethey are able to better memorize the duplicated cross-setsamples.
Previously, we illustrated the impact of code duplication overa relatively simple neural language modeling task where wecould control various factors of variation and observe howdifferent aspects of a model are affected by code duplication.Although the reader probably already suspects that codeduplication affects many other models, here we select a fewstate-of-the-art models and tasks to evaluate the impact ofcode duplication. Again, note this is not an exhaustive evalu-ation, but merely indicates how existing methods cope withcode duplication on datasets similar (and possibly reshuffled)to the ones used by the authors. Our goal here is to illustrate
Table 4.
Impact of Code Duplication on Performance over aSeries of Methods/Tasks. ∆ refers to the relative % improve-ment (worsening). Note that some of the evaluated methodsare evaluated on different datasets compared to those usedin the original works. PerformanceMetric ∆ ( , ) Task: Method Naming Model: code2vec [6]Dataset: Reshuffled Java-Large [5]F1 (%) 44.71 50.98 -12.3% 46.04Precision (%) 53.00 58.92 -10.5% 54.51Recall (%) 38.67 44.93 -13.9% 39.85Task: Variable Naming Model: JsNice [23]Dataset : Reshuffled & Reduced JavaScript-150k [22]Accuracy (%) 34.44 55.04 -37.4% 29.41Task: Code Autocompletion Model: PHOG [9]Dataset : Reshuffled & Reduced JavaScript-150k [22]Accuracy (%) – Types 71.80 75.69 -5.1% 72.95Accuracy (%) – Values 71.19 77.75 -8.4% 71.35– Identifiers 48.94 61.43 -20.3% 49.05– String Literal 25.62 43.89 -41.6% 24.51Task: Docstring Prediction Model: Seq2Seq [7]Dataset: Python Docstrings v1 [7]BLEU 12.32 13.86 -11.1% —the adverse effects of duplication across a diverse set of mod-els and tasks where code duplication is not part of the truedata distribution. It should be noted that none of the resultspresented here should be interpreted as negative results forany of the existing methods. Our study merely illustrateshow different tasks and state-of-the-art models are also af-fected by code duplication. For example, the simple neurallanguage model of Section 4.1 still has a significantly worseperformance compared to PHOG (discussed next), even afterremoving code duplicates.
Tasks and Models
We select four reasonably well-knowntasks in the literature. Note that we re-split the datasetsrandomly assigning each file to a set. This represents caseswhere a model can be used within projects, which is of-ten a realistic scenario in machine learning-based softwareengineering tools. Splitting across projects (as in the offi-cial Java-Large split), can substantially reduce the impact ofcode duplication, depending on the characteristics of eachdataset. • The method naming task of predicting the name of amethod (function) given the body of the function ( i.e. iltiadis Allamanis summarization). Here we run the open-source state-of-the-art code2vec model [6] on the Java-Large corpus [5]. • Variable Naming which is the task of predicting thenames of variables of a snippet of possibly obfuscatedcode. Note that we assume that the task is to deobfuscatenew, previously unseen code rather than code whosedeobfuscated form is known, as discussed in Raychevet al. [23]. We run the state-of-the-art non-neural JsNicemodel of Raychev et al. [23] on the JavaScript-150k [22]dataset using the author-provided data extraction utility.Note that the split differs from the original one and someof the files are missing as discussed in Section 3. • Code Autocompletion which is the language modelingtask used in the previous section. Instead of using theneural model of Section 4.1, we employ the PHOG modelof Bielik et al. [9] another non-neural model. Since thecode is not open-source yet, Pavol Bielik kindly helpedwith training and testing on that model. We provided thesplit on the reshuffled and slightly reduced JavaScript-150k [22] dataset for this task. • Documentation Prediction which is the task of pre-dicting the documentation ( e.g. docstring) of a functionusing its implementation. Here, the most recent approachis that of Barone and Sennrich [7] that use neural ma-chine translation to “translate” code to documentation.Since the authors provided the output of their model, weuse it directly to compute the performance, instead ofperforming our own training.Additionally, we considered the Variable Misuse task [2]which is the task of predicting which type-correct, in-scopevariable to use at a given variable usage location. The onlydataset that is available here is that of Allamanis et al. [2].However, within the variable misuse sites only 0.5% of thedatapoints are duplicated. This is due to the fact that the C not consider this task.Note that for all the tasks considered above, it would beunreasonable to assume that the true distribution reflectingthe particular use case of each tool to contain any duplicates.We train/test all these models with the default parameters This excludes some cases that the JsNice authors have observed in practicewhen they deployed it as a service. Specifically, in personal correspondencethey mentioned to the author that submissions to the JsNice service oftencontain bundled parts of various projects and libraries. As developers usedifferent versions of common libraries, JsNice needs to train/test on all theversions, not just one.The author of this work agrees with the JsNice authors. Indeed theapplication of deobfuscating code by matching it to (partially) previouslyseen code, requires training on duplicated data, since the duplicated datasetrepresents the true data distribution (Section 2) of this partial “soft-matching”use case of JsNice. Thus, this particular use case is one where the truedistribution contains duplicates. as provided by the authors in their open-source releases oftheir code.
Analysis of Results
Overall, we observe in Table 4 thatremoving code duplicates noticeably reduces the measuredperformance of all methods ( ∆ ( , ) ). Although all metricsworsen, the effect differs. For example, JavaScript-150k andJava-Large have very similar (file-level) duplication but theimpact of duplication on the evaluation metrics of PHOG [9]and code2vec [6] is quite different. This can be attributedto two factors (a) different models are affected differently( e.g. because of their inductive biases) (b) different tasks areaffected differently by code duplication.An interesting observation is that training models witha biased dataset ( ) almost always results in worse per-formance compared to training each model in an unbiasedfashion ( e.g. without duplicates, ). This may be due to thefact that part of each model’s capacity is spent on learningabout duplicates, modeling a different data distribution andthus hindering the performance of the model on the dedupli-cated test set. Thus, training on a biased dataset usually hasnegative effects on model performance as observed by end-users( ) . JsNice, a non-neural method, seems to be an exception.This may be attributed to the fact that the reduced size ofthe deduplicated dataset harms performance more than codeduplicates due to the default hyperparameter values. Finally,as we already observed, different metrics are affected dif-ferently. A consistent theme has been that identifier-relatedmetrics ( e.g. accuracy of identifiers of PHOG and of the NLM)are the most severely impacted. Generalizing this, we canconclude that this can be attributed to the sparsity [1] ofsome code constructs ( e.g. identifier names): Rare elements ofcode are hard to predict. Metrics and methods heavily relyingon sparse constructs, such as identifiers, are those most severelyaffected by code duplication . In the previous sections, we believe that we were able todocument and sufficiently illustrate the negative impact ofcode duplication on machine learning models of code. Weobserved that: • The target application of each machine learning modeldictates whether duplicates need to be excluded from thetraining and testing data. • Code duplication affects all metrics and the performanceobserved by end-users is often significantly worse thanthe one reported by evaluation metrics. • Different metrics and applications are affected differentlyby code duplication. • Powerful models that have larger capacity are impactedmore by code duplication. • Comparing different models using duplicated code cor-pora can be unfair to models with smaller capacity. ffects of Code Duplication in Machine Learning Models of Code
Best Practices
Through this paper, a set of best practicesarise that we recommend to researchers and practitioners: • Understanding the True Data Distribution for thetarget use-case. Does the distribution over which weexpect the tool to be used contain duplicates? If not,then deduplication needs to be performed. If duplicatesneed to be removed, the granularity of duplicates shouldbe considered. File-level duplication was studied in thiswork, but other use cases may require more or less fine-grained deduplication. • Data Collection
Collecting large datasets in batch shouldbe done carefully and deduplication methods — like theone proposed by Lopes et al. [19] or the one used inthis work — should be used to deduplicate the collectedcorpus. Simply removing exact matches and forks is areasonable but clearly insufficient first step. Splitting thedataset across different projects, when possible, usuallyhelps a lot, but duplication often still exists. • Use of Existing Datasets
This work demonstrates vary-ing levels of duplication for different datasets. However,duplication occurs to some extent in all existing datasets.When using existing datasets, we suggest using the dupli-cation index provided in this work to remove duplicates. • Model Capacity
Models that have a large capacity tomemorize, suffer the most from the duplication prob-lem and special attention should be given when eval-uating them. Furthermore, researchers should includenaïve memorization methods in their baselines ( e.g. k nearest neighbors). If these baselines perform “too well”compared to other widely-used models, this can indicatea duplication issue.Finally, it should be noted that while removing duplicatesis often the easiest option, small variations of (near) dupli-cates may still be useful to learning more robust machinelearning models. An alternative to discarding duplicates isto down-weight duplicated samples in the loss function andperformance metrics, such that each group of duplicated sam-ples has the same weight as a single deduplicated sample, i.e. transform Equation 2 to¯ f = | X | (cid:213) x i ∈ D c i f ( x i ) . (4) Other Considerations
So far, we have considered the “tra-ditional” option where a fixed dataset is split for trainingand evaluation purposes. In some cases, temporal data maybe available, e.g. the version history of a codebase. Appro-priately, slicing the dataset through time, training on oldercode and testing on newer code, should be considered a validevaluation methodology. Nevertheless, code duplication still The tool can be found at https://github.com/Microsoft/near-duplicate-code-detector and an approximate version withinthe dpu-utils
Python package at https://github.com/Microsoft/dpu-utils . needs to be accounted. For example, a developer might copyexisting code and paste it into a new file, thus “contaminat-ing” a dataset with duplicates.Similarly, deployment of machine learning models oftennecessitate that a model is trained on the same codebase tothe one where it operates on. Although this may sound odd,the deployed machine learning model/tool will only observepreviously unseen code and therefore also operates on anunbiased test environment. This emphasizes the divergencebetween an offline and an online evaluation of some tool. Inmost cases, we are not able to perform online evaluation ofa model, which would provide the most accurate results. In-stead offline evaluations, common in academia and industry,should strive to replicate the conditions of an online system. We hope that this paper informs the research communityabout the negative effects of code duplication on the evalua-tion of machine learning models and informs practitionersabout potential pitfalls when deploying such tools in prac-tice. Removing exact and near duplicates will allow for moreaccurate comparison of machine learning models and meth-ods and will lead to better machine learning-based tools forprogrammers.Finally, despite code duplication’s negative effects manyinteresting research opportunities arise. As Kapser and God-frey [18] observe, code clones are not always bad, as theyoften give developers additional flexibility over the evolu-tion of a project and, therefore, methods should embrace it.The work of Hashimoto et al. [12] who combine retrievalmethods that find similar snippets within a database of codeand then perform edits over those examples is an interestingexample of such a direction.Additionally, in contrast to most artifacts often studied inmachine learning, such as images and text, the independenceassumption (i.i.d) may be too strong: In contrast to commonforms of data, code is created through an evolutionary, in-cremental process. New software is created often becauseother code makes the new software possible and new fea-tures often build up on functionality that already exists. Thisevolution-like process of software, implies a strong depen-dence between code that has been written and code that willbe written. On one hand, this enables ideas such as big codeand naturalness but at the same time complicates evalua-tion of such ideas, as discussed in this paper. Researchingmachine learning models and compatible programming lan-guage representations that can explicitly take into accountthe correlations introduced by this evolutionary process mayallow for improved tools in this area.Finally, code duplication across code is a fact of softwareengineering life and interesting research questions such as“Can new machine learning tools be created that are robust iltiadis Allamanis to code duplication?” and “Can we usefully exploit near-duplicates to produce better software engineering tools?”seem to arise as interesting research problems.
Acknowledgments
The author would like to thank Marc Brockschmidt for usefuldiscussions and suggesting the mnemonic symbols, PatrickFernandes for first noticing the severity of the duplicationproblem and bringing it to the attention of the author andan anonymous reviewer of some other work of the authorthat insisted that code duplication is not an important issuein existing datasets. Finally, the author would like to thankPavol Bielik for running the evaluation on PHOG, Uri Alonfor useful discussions on the Java-Large corpus and usefulcomments on a draft of this work, Charles Sutton and EarlBarr for helpful discussions, suggestions and corrections andanonymous reviewers and SPLASH Onward! PC for helpfulcomments and suggestions.
References [1] Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and CharlesSutton. 2018. A survey of machine learning for big code and natural-ness.
ACM Computing Surveys (CSUR)
51, 4 (2018), 81.[2] Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi.2018. Learning to Represent Programs with Graphs. In
Proceedings ofthe International Conference on Learning Representations (ICLR) .[3] Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A convolu-tional attention network for extreme summarization of source code.In
Proceedings of the International Conference on Machine Learning(ICML) . 2091–2100.[4] Miltiadis Allamanis and Charles Sutton. 2013. Mining source coderepositories at massive scale using language modeling. In
Proceedingsof the Working Conference on Mining Software Repositories (MSR) . IEEEPress, 207–216.[5] Uri Alon, Omer Levy, and Eran Yahav. 2010. code2seq: GeneratingSequences from Structured Representations of Code. In
Proceedings ofthe International Conference on Learning Representations (ICLR) .[6] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019.code2vec: Learning distributed representations of code.
Proceedings ofthe ACM on Programming Languages
3, POPL (2019), 40.[7] Antonio Valerio Miceli Barone and Rico Sennrich. 2017. A Parallel Cor-pus of Python Functions and Documentation Strings for AutomatedCode Documentation and Code Generation. In
Proceedings of the EighthInternational Joint Conference on Natural Language Processing (Volume2: Short Papers) , Vol. 2. 314–319.[8] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin.2003. A neural probabilistic language model.
Journal of MachineLearning Research (JMLR)
3, Feb (2003), 1137–1155.[9] Pavol Bielik, Veselin Raychev, and Martin Vechev. 2016. PHOG: Proba-bilistic Model for Code. In
Proceedings of the International Conferenceon Machine Learning (ICML) . 2933–2942.[10] Joshua Charles Campbell, Abram Hindle, and José Nelson Amaral.2014. Syntax errors just aren’t natural: improving error reporting withlanguage models. In
Proceedings of the Working Conference on MiningSoftware Repositories (MSR) . ACM, 252–261.[11] Mohammad Gharehyazie, Baishakhi Ray, Mehdi Keshani, Ma-soumeh Soleimani Zavosht, Abbas Heydarnoori, and Vladimir Filkov.2018. Cross-project code clones in GitHub.
Empirical Software Engi-neering (2018), 1–36. [12] Tatsunori B Hashimoto, Kelvin Guu, Yonatan Oren, and Percy Liang.2018. A Retrieve-and-Edit Framework for Predicting Structured Out-puts. In
Proceedings of the Annual Conference on Neural InformationProcessing Systems (NIPS) .[13] Vincent J Hellendoorn, Christian Bird, Earl T Barr, and Miltiadis Al-lamanis. 2018. Deep learning type inference. In
Proceedings of the2018 26th ACM Joint Meeting on European Software Engineering Confer-ence and Symposium on the Foundations of Software Engineering . ACM,152–162.[14] Vincent J Hellendoorn and Premkumar Devanbu. 2017. Are deep neuralnetworks the best choice for modeling source code?. In
Proceedingsof the 2017 11th Joint Meeting on Foundations of Software Engineering .ACM, 763–773.[15] Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and PremkumarDevanbu. 2012. On the naturalness of software. In
Software Engineering(ICSE), 2012 34th International Conference on . IEEE, 837–847.[16] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer.2016. Summarizing source code using a neural attention model. In
Pro-ceedings of the 54th Annual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , Vol. 1. 2073–2083.[17] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer.2018. Mapping Language to Code in Programmatic Context. In
Proceed-ings of the 2018 Conference on Empirical Methods in Natural LanguageProcessing . 1643–1652.[18] Cory J Kapser and Michael W Godfrey. 2008. “Cloning consideredharmful” considered harmful: patterns of cloning in software.
Empiri-cal Software Engineering (ESEM)
13, 6 (2008), 645.[19] Cristina V Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang,Jakub Zitny, Hitesh Sajnani, and Jan Vitek. 2017. DéjàVu: a map ofcode duplicates on GitHub.
Proceedings of the ACM on ProgrammingLanguages
1, OOPSLA (2017), 84.[20] Chris Maddison and Daniel Tarlow. 2014. Structured generative modelsof natural source code. In
Proceedings of the International Conferenceon Machine Learning (ICML) . 649–657.[21] Kevin P Murphy. 2012.
Machine Learning: A Probabilistic Perspective .MIT Press.[22] Veselin Raychev, Pavol Bielik, Martin Vechev, and Andreas Krause.2016. Learning programs from noisy data. In
Proceedings of the Sym-posium on Principles of Programming Languages (POPL) , Vol. 51. ACM,761–774.[23] Veselin Raychev, Martin Vechev, and Andreas Krause. 2015. Predictingprogram properties from Big Code. In
Proceedings of the Symposium onPrinciples of Programming Languages (POPL) , Vol. 50. ACM, 111–124.[24] Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code comple-tion with statistical language models. In
Proceedings of the Symposiumon Programming Language Design and Implementation (PLDI) , Vol. 49.ACM, 419–428.[25] Chanchal Kumar Roy and James R Cordy. 2007. A survey on softwareclone detection research.
Queen’s School of Computing TR
Software Engineering (ICSE), 2016 IEEE/ACM 38th Interna-tional Conference on . IEEE, 1157–1168.[27] Ewan Tempero, Craig Anslow, Jens Dietrich, Ted Han, Jing Li, MarkusLumpe, Hayden Melton, and James Noble. 2010. The Qualitas Corpus:A curated collection of Java code for empirical studies. In
SoftwareEngineering Conference (APSEC), 2010 17th Asia Pacific . IEEE, 336–345.[28] Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-RMSProp:Divide the gradient by a running average of its recent magnitude.