[PDF] DeepClone: Modeling Clones to Generate Code Predictions

Abstract

Programmers often reuse code from source code repositories to reduce the development effort. Code clones are candidates for reuse in exploratory or rapid development, as they represent often repeated functionality in software systems. To facilitate code clone reuse, we propose DeepClone, a novel approach utilizing a deep learning algorithm for modeling code clones to predict the next set of tokens (possibly a complete clone method body) based on the code written so far. The predicted tokens require minimal customization to fit the context. DeepClone applies natural language processing techniques to learn from a large code corpus, and generates code tokens using the model learned. We have quantitatively evaluated our solution to assess (1) our model's quality and its accuracy in token prediction, and (2) its performance and effectiveness in clone method prediction. We also discuss various application scenarios for our approach.

Full PDF

DDeepClone: Modeling Clones to Generate Code Predictions

Muhammad Hammad

Eindhoven University of Technology, Netherlands ¨Onder Babur

Eindhoven University of Technology, Netherlands

Hamid Abdul Basit

Prince Sultan University, Saudi Arabia

Mark van den Brand

Eindhoven University of Technology, Netherlands

ABSTRACT

During software development, programmers often tend to reusethe code for common functionalities, available in other source coderepositories. This activity helps them to reduce time and effort todevelop code, instead of building it from scratch. Code clones arecandidates for reuse in an exploratory or rapid development, asthey represent often repeated functionality in software systems.To facilitate code clone reuse, we propose a novel approach,

Deep-Clone , where we utilize a deep learning algorithm for modelingcode clones and predicting the next possible set of tokens (up to thecloned method body) based on the user input so far. The generatedpredictions aim to potentially help developers to write code rapidlywith minimum tuning of values later on. DeepClone applies naturallanguage processing techniques to learn from a large code corpus(the BigCloneBench dataset), and generates code tokens (full clonemethods where applicable) using the model learned. We have quan-titatively evaluated our solution to assess (1) our model’s qualityand its accuracy in token prediction, and (2) its performance andeffectiveness in clone method prediction. With a high quality andaccurate model as the foundation, we further discuss scenarios forexploiting our approach.

CCS CONCEPTS • Computing methodologies → Neural networks; • Softwareand its engineering → Language features; Reusability; Au-tomatic programming; Software development methods;

KEYWORDS natural language modeling, deep learning, code clone, code predic-tion

Many software development and maintenance tasks rely on search-ing the source code effectively [69]. However, the structure ofsource code makes it difficult to be read in a linear fashion likenormal text and hinders an effective code search. It is also tediousand impractical to read the entire source code of large software sys-tems. Writing new source code is an expensive activity, consumingconsiderable time and effort. To overcome this limitation, devel-opers frequently perform ad hoc code reuse [38], which requiresselectively reading the source code to identify the relevant partsof the code for reuse [75]. Often, programming of well-definedproblems amounts to a simple look-up [75], first in onefis own, andthen in othersfi code repositories, followed by judicious copyingand pasting. The bigger the dataset of code available, the morelikely it is that one will find what one is looking for [25]. For this purpose, developers require features such as code snippet search,code predictions, code auto-completion and code generation, whichcan help them to write code quickly and easily. These problemshave been usually solved by language modeling.Claude Shannon is considered to be a founder of language mod-eling [72, 73]. Shannon used this technique to predict the nextelement following some given text and bound the entropy of theEnglish language. Since then, several language models have beendeveloped to perform difference tasks. A language model (LM)estimates the likelihood of sequences of tokens based on a trainingdataset, by assigning probabilities to tokens (words, subwords, orpunctuation marks) or character sequences (sentences or words oc-curring after a given sequence [39]). There are different techniquessuch as statistical ones and deep neural networks (DNN), whichhave been applied for LMs including n-gram, graphed-based, andcontext sensitive models [6, 8, 56]. Both techniques have led togreat results on natural language processing (NLP) tasks mainly,because in practice natural language is often repetitive and pre-dictable [31], thus can be modeled using both techniques. However,statistical modeling techniques does not perform well, when thesize of vocabulary is very large. During the software developmentlife-cycle, developers declare new identifier names at a far higherrate, which degrade the performance of statistical language models[41].Deep neural network (DNN) techniques, on the other hand, areextremely powerful machine learning models that achieve excellentperformance on various difficult problems such as speech recog-nition [21] and visual object recognition [44]. A recent study [41]shows that deep neural network (DNN) techniques indeed outper-form statistical language modeling techniques. Their power arisesfrom the fact that they can perform arbitrary parallel computationfor a modest number of steps. A surprising example of the powerof DNNs is their ability to sort N N-bit numbers using only 2 hid-den layers of quadratic size [66]. Various DNN techniques havebeen applied on source code to perform language modeling [14], bylearning the features from a large source code dataset. DNN tech-niques involve an algorithm that improves automatically throughexperience based on data. It is a way to discover a new algorithmfrom the experience, and involves the study of algorithms that canextract information automatically. These LMs have been referredas neural language model (NLM). NLMs have been used to performvarious tasks such as speech recognition [20], machine translation[36], comment generation[34], fault detection [64], code comple-tion [54][57], code clone detection [81, 90], code search [28] andcode summarization [35]. a r X i v : . [ c s . S E ] J u l ne such application of a language model is code prediction,which has received a good deal of attention from the software engi-neering researchers [15, 16, 92]. It involves automated techniquesfor aiding software development and maintenance by proposingnext likely tokens on the basis of user inputs. A prediction model iscapable of automatically learning features for representing sourcecode, and using them for next token prediction in a sequence. Neu-ral Language Models (NLMs) represent code tokens in a continuousvector space, which has attractive properties. Pythia [82] is an AIassisted code completion approach, which generates ranked listsof method, and API recommendations at edit time. Similarly, DeepTabNine [2] has been recently launched, which is an auto-completetool trained on approximately two million GitHub files that intendsto enhance software developersfi workflows.These models can be refined by tuning several parameters suchas layering, epoch value, batch size and the size of token sequences.Hence, by keeping these benefits of deep learning, we have usedthem to build a prediction model. Neural language models canoutperform carefully tuned n-gram models, when modeling naturallanguage [37]. Researchers have applied various types of deepneural networks (DNN) for language modeling such as RecurrentNeural Network (RNN) [53, 94], its variants such as Long Short TermMemory (LSTM) [32, 42, 77], and Transformer based GenerativePre-Training-2 (GPT-2) [62], a successor of GPT [60], for performingcode predictions.These code predictions are at the level of a fixed threshold tokenlength. We believe that method level clones of arbitrary length,extracted from a large code repository, can enhance regular codegeneration and prediction applications. Code clones are repeatedpatterns in the code, which are usually created with the copy-and-paste practice mentioned above. According to Roy and Cordy[68], around 5 to 50% of the code in software applications canbe enclosed in clones, which can be of several granularity levelssuch as line, method, file, and directory. Most of the time clonesare considered to be harmful for a software system, and mainlyresearchers work on techniques for avoiding and eliminating clones[10, 11, 88, 93]. However, Kapser and Godfrey [40] observe thatcode clones are not always bad, as they can also have positive effectson the software development in certain circumstances. One of thepositive use cases is exploratory development, where the rapiddevelopment of a feature is required and the remedial unificationof newly generated clone is not clearly justified. Also, a piece ofcloned code is expected to be more stable and poses less risk thannew development. Hence, it can be argued that developers needto reuse code for desired software features in a way that supportsopportunistic programming for increased productivity.We believe that clone methods can be considered as a usefulcomponent for a language model, as they can be used to capturethe common code practices of developers, which then can be usedfor code predictions and completions to the new developer. Inthis work, we exploit the re-usability aspect of code clones forthe purpose of source code modeling in predicting code tokens byapplying a deep learning algorithm. We believe that our approachcan help in improving the quality of code prediction up to full clonemethod bodies with arbitrary length. In this paper, we have madethe following contributions: (1) We present the first attempt in literature for explicitly mod-eling of code clones using deep learning to be utilized forcode/clone method prediction.(2) Our approach can generate predictions up to the full clonedmethod body (with arbitrary length) on the basis of userinput.(3) We have quantitatively evaluated our approach using theBigCloneBench dataset, in terms of model quality and per-formance in various tasks including token prediction, andclone method prediction. In this section, we present related work covering source code lan-guage modeling techniques, and the role of learning based tech-niques in the field of code clones.

To the best of our knowledge no techniques have been presented tomodel code clones for predicting clone methods. However, manytechniques have been introduced to perform language modeling forvarious other tasks such as token prediction and completion. Whiteet al. [91] apply Recurrent Neural Network (RNN) to model Javasource code and evaluate its performance on code prediction task.Bold [16] specifically models Java language method statements andcompare the performance with English language datasets trainedby Long Short Term Memory (LSTM). He argues that method state-ments highly resemble with English language sentences and can becomparable with each other. He compares the performance of mod-els on predicting a next element and demonstrate that LSTM canachieve performance trained on the Java datasets higher than theEnglish dataset. Hindle et al. [31] make a fair comparison betweensoftware and natural language by discovering that software is muchmore repetitive and well structured than natural language. So, it ismuch simpler to model Java code by using n-grams rather than theEnglish language. They compare the performance of language mod-els on next element prediction task and demonstrate that n-grammodels trained on Java dataset performs much better as comparedto n-gram models trained on English language dataset. Hellendoornand Devanbu [30] notice that source code based neural languagemodels (NLMs) under-perform due to the unlimited scope of thevocabulary size, as in software development life cycle identifierskeep on coming with higher rates, and limiting vocabulary is not agood solution with NLMs. So, they propose a nested scope, dynam-ically updatable, unlimited vocabulary count-based ngram model,which outperforms the LSTM model on the task of token prediction.In contrast, Rafael et al. [41] solve the issue of unlimited scope ofvocabulary size by applying byte-pair encoding (BPE) technique inmodeling the code. They compare the performance of n-gram andGated Recurrent Unit (GRU) language models trained on sourcecode datasets, and demonstrate that NLM can outperforms on codecompletion and bug detection tasks if BPE technique is applied.

The only work we have come across that uses clone methods forrecommending code completion is by Shamsa [3]. However, the lone methods considered in this system are based on a differentnotion of similarity — similar API calls.Learning based approaches have been extensively used for clonedetection. White et al. [90] use recursive neural network (RNN)for clone detection. Wei et. al [89] use the LSTM model for thefunctional clone detection problem by learning supervised deepfeatures. CCLearner [48] extracts tokens from known method-levelcode clones and non-clones in a given codebase to train a classifier,which is then used to detect clones. Tufano et al. [85] use a deeplearning based approach to automatically learn code similaritiesfrom different representations. Vara et al. [7] propose a methodto increase the precision of code clone detection using machinelearning techniques. They apply 19 clone class metrics to capturedifferent characteristics of code clones and use them to train adecision tree model. In the literature, there exist a few code clone benchmark data sets,such as OCD benchmark [63], Bellon’s benchmark [13], Murakamiet al.fis benchmark [55] and BigCloneBench[78–80]. The largestone is BigCloneBench, which consists of over 8 million manuallyvalidated clone method pairs in IJaDataset 2.0 [1]- a large Javarepository of 2.3 million source files (365 MLOC) from 25,000 open-source projects. BigCloneBench contains clones with both syntacticand semantic similarities.BigCloneBench contains the references of starting and endinglines of method clones existing in the code repository. In formingthis benchmark, methods that potentially implement a given com-mon functionality were identified using pattern based heuristics.These methods were manually tagged as true or false positives ofthe target functionality by judges. All true positives of a function-ality were grouped as a clone class, where a clone class of size n contains n ( n − ) clone pairs. The clone types and similarity ofthese clone pairs were later identified in a post-processing step.Currently, BigCloneBench contains clones corresponding to 43 dis-tinct functionalities. Further details can be found in the relevantpublications [78–80].BigCloneBench has been primarily developed to measure andcompare the recall of clone detection tools [70, 78–80]. However,it can also be used for other clone and software studies [80]. Li etal. [48] have developed a DNN-based clone detector, CCLearner,using BigCloneBench. IJaDataset [1] is very large, and outside the scalability limits ofmost clone detection tools. However, the clone detection toolsdo not need to be executed for the entire IJaDataset, only for thefiles containing reference clones in BigCloneBench. Svajlenko etal. [78–80] provide a reduced version of IJaDataset, which containsonly the relevant source files and is distributed into a number ofsmaller subsets for clone detection. There is one subset per func-tionality in BigCloneBench. Each functionalityfis subset includesall the files, which contain methods tagged as true or false positiveof that functionality in the creation of BigCloneBench. Thereforeeach subset is a realistic subject system, containing both true andfalse positive clones. We have performed pre-processing steps to build our mutually exclusive training, testing, and validationdatasets. The training set is used to train our DeepClone languagemodel. After each training epoch, the trained model is evaluated onthe validation set and its performance helps in assessing the con-vergence against hyper-parameters (e.g. learning rate in gradientsearches). The validation set is not used to learn any of the modelfisparameters. The testing set is used for empirical evaluation of ourDeepClone model. Table 1 demonstrates the pre-processing stepson an example of binary search clone method.

We have applied the following query to extracttrue positive clone methods from BigCloneBench dataset and theirreferences in IJaDataset. select distinct a.functionality_id, b.type, b.name,b.startline, b.endline from clones ajoin functions b on a.function_id_one=b.idunionselect distinct a.functionality_id, b.type, b.name,b.startline, b.endline from clones ajoin functions b on a.function_id_two=b.id

In the above query, we have applied the union operation to dis-card duplicated results. The ”functions” table contains informationabout true and false positive clone method, including filename,starting and ending line position of the clone method, the typeid of the method. Whereas the ”clones” table contains the list oftrue positive clone method pair information including syntacticsimilarity and validity measures. The result allows us to includeall the files, which have true positive clone methods, and discardthose files, which have only false positive clone methods from thereduced version of IJaDataset.

In this step, we need to distribute the set offiles into training, validation, and testing datasets. We have adopteda strategy of stratified sampling [83] in order to ensure that alltypes of clone methods appear in training, validation and testingdatasets. We distribute the set of files existing in each functionalityfolder, into portions such as 80% training, 10% validation, and 10%testing. Then, we copy those files belonging to three separate folderssuch as training, validation, and testing. If any of the file alreadyexist in one of those folders, we discard that specific file, thereforeavoid exact duplication. This means that the file has already beenmoved to one of the specific folder type such as training, validation,and testing. Allamanis [5] also notices that the negative impacthas been occurred on model performance, if same file has beenused for training and testing. Tables 2 and 3 depict the detailedstatistical overview of our datasets, which we have used for training,validation and testing. We have only mentioned the titles of thefunctionalities in Table 2, and excluded further details. Those areout of scope for this paper and can be found in the BigCloneBenchdataset.

Researchers in the past have used special meta-tokens to solve various problems. Pichotta and Mooney [59] place (cid:10) S (cid:11) and (cid:10) /S (cid:11) meta-tokens in modeling sentences for prediction.Chen et al. [18] insert (cid:10) START BUG (cid:11) and (cid:10)

END BUG (cid:11) meta-tokensin the buggy lines of the source code, which helps in automatically epairing programs. We have marked the regions of the true positiveclone methods, by placing the meta-token (cid:10) soc (cid:11) at the start, and (cid:10) eoc (cid:11) at the end of a clone method in the IJaDataset files, by tracingthe clone method references from the BigCloneBench dataset. We have adapted Javalang Python library, which contains a lexer and parser for Java 8 pro-gramming language. This library helps us to normalize our code byremoving whitespaces, extra lines, comments, as well as to tokenizethe code.

For each set of files, we have replaced in-teger, float, binary, and hexadecimal constant values with the (cid:10) num val (cid:11) meta-token. Similarly, we have replaced string and char-acter values with (cid:10) str val (cid:11) . This reduces our vocabulary size, whichleads to faster training of the model. This technique has been usedby several researchers in the same manner for data preparation[22, 41, 91].

We have merged all the tokenized data existingin the training, validation and testing folders, and placed them intoseparate text files, i.e. train.txt, valid.txt, and test.txt. These tokensare separated by the space character. Table 3 gives a statisticaloverview of our experimental dataset.

In this section, we discuss our approach in modeling code clonesusing GPT-2 and the experimental setup used to conduct our study.

OpenAI developed a large-scale unsupervised language model calledGPT-2 (Generative Pretrained Transformer 2) [60–62] to generateseveral sound sentences of realistic text by extending any givenseed. GPT-2 is a large transformer-based language model with 1.5billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, givenall of the previous words within some text. It is a direct scale-up ofGPT, with more than 10X the parameters and trained on more than10X the amount of data. This model has been trained on a largecorpora of web pages and social media. We focus on fine-tuning aGPT-2 transformer [60–62] pre-trained model for generating codeclones, even though it has been trained on English language. But,we apply fine-tunning of a pre-trained model on IJaDataset (Javalanguage dataset), as it exists a large amount of overlapping vo-cabulary with English language. Secondly, GPT-2 is claimed to beso powerful that the threat of its harmful use is high. Moreover,GPT-2 transformer has demonstrated impressive effectiveness ofpre-trained language models on various tasks, particularly soundtext generation. This architecture was first developed to solve prob-lems in natural language processing. GPT-2 has in built Byte PairEncoding tokenizer (BPE).During software development, the number of unique identifiersincreases with the size of the codebase [6]. This problem makesit infeasible to train LMs on large corpora. BPE is an algorithmoriginally designed for data compression, in which bytes that are https://github.com/c2nes/javalang not used in the data replace the most frequently occurring bytepairs or sequences [26]. BPE provides several benefits [41]. First, notoken is considered out-of-vocabulary. Unknown tokens at test timeare represented by subsequences. Second, it dynamically adaptsthe frequency of the sequences by merging common subsequencesand keeping less frequent tokens as is. Common tokens will berepresented by a single token, while infrequent ones will be splitin prefixes, suffixes and roots. This confirms that each sequence iscommon enough to have valuable embeddings. Finally, it allowsfor a very precise control of the vocabulary size, by tweaking thenumber of merges performed. A larger vocabulary will containmore complete tokens and less sequences, whereas smaller oneswill contain longer sequences. It has been reported in the literaturethat neural language modeling (NLM) along-with BPE outperformsseveral traditional approaches, e.g. using n-grams [30].GPT-2 model further simultaneously performs well on a varietyof language tasks including question answering, reading compre-hension, summarization, and translation [62]. We further noticethat in general, better pre-trained models lead to better performanceon fine-tuned or transfer tasks [58]. Fine-tuning is one approach totransfer learning, which is to adjust feature weights according tothe new dataset on some already trained model. Previously, GPT-2Transformer has been successfully fine-tuned on different types ofdatasets. Kim et al. [43] have used GPT-2 transformer for code pre-diction by revealing the syntactic structure of code to the network.Ziegler et al. [95] have applied a reinforcement learning method onthe 774M GPT-2 model to support human-preferred text more often.Lee et. al [46] fine tune 345M, a GPT-2 based pre-trained modelof medium version, to patent claim generation by providing vari-ous experiment results for qualitative analysis and future research.Deep TabNine [2], a software programming productivity tool topredict the next chunk of code, has been successfully fine-tunedby using GPT-2, on approximately two million GitHub files cap-turing numerous programming languages. DeepClone is initiallyinspired by Deep TabNine, and we have fine-tuned only those filesof IJaDataset, which contains true positive clone methods fromBigCloneBench dataset. We selected a small version (117) of GPT-2 based pre-trained modelas our base model, as it can not take too much time and resources tofine-tune. Other versions are quite bigger in size. Similarly, a smallversion is enough to proof our approach. The 117 pre-trained modelcontains 50257 vocabulary size, 117M parameters, 12-layers, 768-hidden, and 12-heads. We have fine-tuned our DeepClone modelon the partition of a GPU-1080Ti cluster (276 CPU cores, 329728CUDA cores, 5.9 TB memory) , for approximately 9 hours by usingHuggingFace Transformer Library. In our experiment, we haveperformed training and evaluation with batch size per GPU of 1for 5 epochs. We have used a learning rate of 5e-5 and the gradientaccumulation steps (number of update steps to accumulate beforeperforming a backward/update pass) as 5. Default values have beenused for other hyper-parameters, as mentioned in the languagemodeling code . https://userinfo.surfsara.nl/ https://github.com/huggingface/transformers/blob/master/examples/language-modeling able 1: An example of the preprocessing steps applied on a clone method body OriginalMarkingNormalization andTokenizationReplacement (a) Perplexity over the validation dataset (b) Training average losses (c) Convergence of the learning rate

Figure 1: Training graphs

In order to determine the performance of our approach, we haveperformed both intrinsic and extrinsic evaluations.

Intrinsic evaluation refers to the evaluation of a model by measuringits quality. For this purpose we have used perplexity (as done in[16, 22, 91, 94]), which is an inverse of cross-entropy (as used in[30, 41]). Perplexity is a measurement of how well a given languagemodel predicts sample data. It estimates the average number ofcode tokens to select from at each point in a sequence [6, 52]. It is a natural evaluation metric for language models, which represent aprobability distribution over a subsequence or an entire dataset. P ( L ) = exp (− M M (cid:213) i log P ( t i | t : t i − )) (1)The formula for perplexity is presented in Equation 1. P ( t i | t : t i − ) is the conditional probability assigned by the model to thetoken t at index i . By applying loд of conditional probability, cross-entropy loss is calculated. M refers to the length of tokens. Hence, able 2: Detailed statistics of the training, validation, and testing datasets along with experimental results Files Clone Methods Syntactic SimilarityFunctionality Id Name Training Validation Testing Training Validation Testing

P P L µ σ able 3: Final Distribution of BigCloneBench DatasetFiles Clone Methods TokensTraining Validation

Testing

Total

Figure 2: DeepClone training process perplexity is an exponentiation of the average cross entropy lossfrom each token [ , M ] . Training Evaluation.

At each checkpoint (500 logging step) ofthe training steps, we have evaluated the DeepClone model perfor-mance by calculating the perplexity on the validation set. Figure 1adescribes the variations in perplexity on the validation set after eachcheckpoint, which takes place at each 500 log step. We observe thatwe achieve the lowest perplexity P1 (2.145) at step 24500. Figure 1cdisplays the convergence of the learning rate after each checkpoint.Learning rate helps in determining how quickly or slowly a neuralnetwork model learns a problem, by adjusting the weights of anetwork with respect to the value of loss function. Whereas, lossfunction is to calculate a model error, which identifies how wella model predicts the expected outcome for any data point in thetraining set. GPT-2 uses cross-entropy loss function, which is tomeasure the performance of a language model, whose output is aprobability value between 0 and 1. Figure 1b displays a convergenceof training losses after each checkpoint of 500 steps, which indicateshow well a model behaves after each checkpoint of optimization.The loss value is finally minimized to 0.75 at step 24500, which is asign of a well optimized deep learning model. Figure 2 describes thetraining process of DeepClone Model, which mentions the stepsdescribed in Section 3, used to perform the fine-tuning of our model.We have published our training results online . https://tensorboard.dev/experiment/tk1XqDi8RMqtrMjmVyQ9Sg The DeepClone model has been successfully fine-tuned on theBigCloneBench dataset by using a powerful GPT-2 basedpre-trained model.

Model Evaluation.

We computed an overall perplexity of 2.146on the testing dataset for our DeepClone model, as denoted by P2 in Table 4. We have achieved a much better perplexity compared tothe previous source code LMs [16, 22, 91]. These models, though,use different corpora and deep learning architecture than ours andhence are not directly comparable. We believe our better perfor-mance can be attributed to the fact that we model a more repetitiousbody of code (i.e. clones) to begin with, as well as that we use apre-trained model based on the powerful GPT-2 transformer.Besides the overall perplexity on the testing dataset as previouslyelaborated in the paper, we again calculate the perplexity using thetesting dataset, but without the clone method markers (i.e. (cid:10) soc (cid:11) and (cid:10) eoc (cid:11) ). Our motivation for this additional measurement is as fol-lows. Hindle et al. [31] observe that due to the repetitive nature ofthe code, there exist predictable statistical properties, which n-gramlanguage models can capture and leverage for software engineeringtasks. The sign of a good model is that it can capture the patternsin the dataset very well, which is particularly important for thetask of clone method prediction. In Table 4, we can see an increasewhen comparing the original perplexity score of 2.146 ( P2 and theperplexity on the testing dataset without clone markers of 2.182( P3 ). This means that our DeepClone model performs better, whenthe code has clone methods marked with (cid:10) soc (cid:11) and (cid:10) eoc (cid:11) tokens.The prediction capability of DeepClone is better on code whichhas marked clone methods. Evaluation per Clone Method.

In order to determine which clonemethod snippets are more predictable compared to the others, wecalculated an average perplexity score (

PPL ) for each type of clonemethod snippet (see Table 2). We first extracted the code snippet foreach type of clone method for our testing dataset, and calculatedthe perplexity score. The scores, as depicted in Table 2, indicatehow likely these can be predicted by our DeepClone model.We also analyze several factors which can effect the perplex-ity of clone methods. BigCloneBench contains scores of syntacticsimilarity between each clone method pairs on the basis of tokens,which have been calculated by using a line-based metric after fullnormalization such as removing comments etc. We have calculatedmean ( µ ) and variance ( σ ) to determine the overall syntactic sim-ilarity of all the clone methods per each type of functionality, asmentioned in Table 2.We observe that the perplexity scores vary according to thesyntactic similarity between clone methods, as well as the numberof clone method snippets in the training set. From the results, wecan see that the ”Test palindrome” type of clone method (number44), which is used to test if a string is a palindrome, has the lowestperplexity score. This represents that our DeepClone model canpredict that clone method snippet very well. We can attributethis to the fact that those types of clone method snippets havequite high mean syntactic similarity (0.903), along with very lowvariance (0.040). Similarly, 133 total clone method snippets havebeen used in the training, which is not a reasonably high number. able 4: DeepClone evaluation results on the testing datasetPerplexity AccuracyP1 P2 P3 P4 P5 MRR Top 1 Top 3 Top 5 Top 10 Non-Clone Method vs Clone Method.

Allamanis [5] notices that alanguage model achieves low perplexity scores for code duplication,and high perplexity score for less duplicated code. In order toobserve that difference, we calculated the perplexity scores for allthe clone method snippets and non-clone method snippets in thetesting dataset. We extracted clone method snippets by tracing thetokens, which come inclusively between (cid:10) soc (cid:11) and (cid:10) eoc (cid:11) token. Allother snippets were considered to be a part of non-clone methodsnippets. We then calculated the perplexity for each snippet. Finally,we took an average of the perplexities for both type of code snippets. P4 represents the average perplexity score for the clone methodsnippets, and P5 represents the average perplexity of the non-clone method snippets. The result (Table 4) depicts that DeepClonesuccessfully predicts clone method snippets much better than non-clone method snippets in general.DeepClone predicts clone code method snippets more accuratelythan non-clone ones. Extrinsic evaluation refers to the evaluation of a model’s perfor-mance on specific tasks. We evaluated DeepClone on the tasks oftoken prediction, and clone method prediction.

Token Prediction.

We collect the top 10 predictions from ourDeepClone model, and compute the top-k accuracy (the fraction oftimes the correct prediction appears in the top k predictions) for k ∈ [1, 10]. Moreover, we measure the Mean Reciprocal Rank (MRR)scores of our language model. A simplified description of MRR isthat it averages top-k prediction accuracy across various k. In this specific scenario k ∈ [1, 10] since the models output a list of top-10best tokens. The MRR is a rank-based evaluation metric, whichproduces a value between 0-1, where the value closer to 1 indicates abetter source code prediction model. MRR has previously been usedfor evaluating code prediction by many researchers [30, 41, 65, 84].The reciprocal rank of a query response is the multiplicative inverseof the rank of the first correct answer, while MRR is the average ofreciprocal ranks or results for a sample of queries C defined as inequation 2 MRR ( C ) = | C | | C | (cid:213) n = y n (2)where C is a code sequence and y n refers to the index of the firstrelevant prediction. MRR(C) is the average of all sequences C inthe testing data set.Table 4 shows the top-k accuracies as well as the MRR score.Clearly, the results suggest that the proposed DeepClone modelis able to more accurately model pre-processed Java source codecontaining clone methods. The table also indicates that there isalmost 77.808% chance to get a correct token in the first option, and94.999% chance to have a correct output in the top-10 predictedoutcomes. To further quantify the accuracy of DeepClone Modelfor token prediction task, we report an MRR score of 83%, whichindicates an excellent performance in evaluating a ranked list ofpredictions.DeepClone produces highly accurate results on the tokenprediction task. Clone Method Prediction.

We further measure the effectiveness ofour DeepClone model in predicting code tokens for clone methodtask. This demonstrates our model’s effectiveness in providing easeto the developer’s work by generating the clone method snippet.In order to predict a chunk of subsequence from a clone methodbased on the user input, there exist several text generation methodssuch as beam search [86], sampling with temperature [4, 24], top-ksampling [23] and nucleus sampling [33]. All these methods havea specific decoding strategy to shape the probability distribution oflanguage model such as assign higher probability to higher qual-ity text. Among these text generation methodologies, we choosenucleus sampling as it outperforms other methodologies [33]. Nu-cleus sampling is claimed to be best the strategy for generatinglarge form of high quality text, comparable to human written text.By having GPT-2 well fine-tuned, DeepClone model, along withnucleus sampling, we can expect informative set of code tokensfor clone method predictions. For this purpose, we extracted sub-sequences of 20 tokens from the testing dataset, and moved thewindow one step ahead to obtain further subsequences. Amongthose, we randomly selected 1,144 subsequences containing 22,880tokens, in which (cid:10) soc (cid:11) token is a part of each subsequence, which ndicates a start of clone method. We passed these subsequencesone by one to our DeepClone model, and we kept on predictingnew tokens with nucleus sampling (threshold value 0.9) until themeta-token (cid:10) eoc (cid:11) (i.e. end of clone) appears. With this experiment,we successfully generated 151,348 tokens associated with clonemethods. Given the 1144 cases, this amounts to an average of ∼ The approach we have proposed leads to promising results. Themetrics in the training phase (learning rate approaching 0, min-imized loss) and in the validation phase (perplexity of 2.145) allindicate a fine-tuned model. The series of perplexity scores, we cal-culated allows us to conclude that our DeepClone model can predictregularities successfully in terms of clone markers, including thecode in general and the individual clone snippets in particular. Theextrinsic evaluation reveals that we achieve a quite high accuracy,notably 95% in the top 10 suggestions, as well as larger numberof tokens than a threshold-based strategy even with a generousthreshold of 100. With a high quality and accurate model as thefoundation, we would like to discuss next the potential use cases toexploit our model, as well as limitations to our work.

The DeepClone model can be utilized to assist developers in vari-ous use cases. Some of these have already been mentioned above:predicting the next token (as typically done by many LMs), andwhole clone method body. The latter, while seemingly straightfor-ward, can be enhanced with a more elaborate ranking and retrievalmechanism rather than simply generating the most likely sequenceof tokens one after another. For that purpose, the additional infor-mation in the BigCloneBench dataset, including the exact clonemethod clusters (hence various methods representing the samefunctionality), clone types and so on can be exploited. Anotheruse case might involve clone refactoring (and avoidance), by rec-ommending clone method call instead of predicting a completeclone method snippet. In combination with some additional ef-fort of transforming the clone methods into reusable assets (e.g. in the form of libraries), the prediction could be tweaked to avoidcloning in the first place and generate method calls instead. Finally,the model can be used to perform code search for the commonfunctionality.

Our work has certain limitations. The proposed approach is thefirst step, which raises the granularity level to method for code pre-diction. However, we cannot expect exactly the same clone methodpredicted or completed, as trained by our DeepClone model. Inprediction tasks, generating well-formed outputs is challenging,which is well known in natural language generation [47, 74]. How-ever, the desired output might be a variation of another, previouslyobserved sample [27, 29, 45, 49, 76]. This is because of the nature oflanguage modeling. Language model is a probabilistic model, whichcan generate multiple possibilities of code snippet, based on theuser input. The space of possible clone methods that could be gen-erated grows exponentially with the length of the clone methods.By having V tokens in the vocabulary, there can be V N possibleclone methods of length N that could be sampled. An extensionwould involve displaying the most similar cloned methods (as is)from the dataset to the user. BigCloneBench contains clone methodreferences of only those snippets, which belong to the selected listof 43 common functionalities. It does not contain references of allthe clones in the dataset. Although the dataset is enough to proofour methodology, there is a possibility to model all the clones inthe dataset, which may result in interesting findings. Similarly, weobserve that syntactic similarity on the overall clone methods andthe number of clone methods used in training can effect perplexity.However, there can be several other factors such as clone type,syntactic similarity over false positive clones, and GPT-2 hyper-parameters. In our study, we relied on the HuggingFace transformerimplementation of GPT-2 to train and evaluate DeepClone model.While GPT-2 is a reliable implementation that has been utilizedin a number of NLP experiments [9, 67, 71], it is still an emergingproject. However, our results and trends are aligned with thosethat have been obtained in the field of NLP. Hence, we are positivethat the results are reliable.Allamanis [5] discusses the negative effects of code duplicationin the evaluation of language models and proposes two strategies inorder to avoid biased evaluation results. The first one is to removethe duplicated code before actually developing a model, and thesecond one is to down-weight duplicated samples in the loss func-tion and performance metrics. This might have implications for ourstudy, although our motivation for this work is modeling clonedcode in the first place. Therefore we claim our true data distributionby definition includes duplication. Furthermore, in his study heonly considers exact and near-miss file duplicates. Clones can beof several types and granularity levels such as simple, structuraland method [12], which might need to be considered as well. Onthe other side, he remarks on the common cloning practice andpotentially positive use of clones, which we agree with and havetried to exploit in this paper.As can be expected from a DNN-based study, we could not eval-uate all the possible combinations (hundreds) of hyper-parameter ue to the resources needed. There is a risk in the choice of hyper-parameters for deep learning methods. The change in training,validation or testing set or the variation in hyper-parameters mayimpact the performance of the anticipated method. For this reason,we also did not evaluate other NLM architectures such as GRU [19],LSTMs [32], additional neural cache variants [51, 87] or QRNNs[17]. The dataset used in this study is collected from BigCloneBench,a well-known cloned code dataset. It does not necessarily mean thatthe codebase used in this study represents Java language sourcecode entirely (threat to external validity). As for the clone methodprediction, we only use nucleus sampling [33]. There are variousother text generation methods such as beam search [86], samplingwith temperature [4, 24], and top-k sampling [23], which can beexplored for predicting clone method snippet. In this work, we proposed DeepClone, a deep learning-based clonedcode language model. We performed intrinsic and extrinsic evalua-tions to determine the performance of DeepClone model in predict-ing clone methods. The extensive evaluation of this work suggeststhat the proposed approach significantly improves the modelfisperformance by exploiting the concept of deep learning and codeclones. Following from this work, we would like to exploit ourmodel in the potential use cases, which are discussed above (seeSection 6.1). From a fundamental point of view, though, our ap-proach can be improved in several different ways as future work.Our model can be improved by hyper-parameter optimizations[50], as well as better training (e.g. on a larger dataset or largerpre-trained GPT-2 models). Furthermore, we plan to investigatehow we can tackle different types and granularity levels of codeclones such as simple clones, structural clones, and file clones [12].

ACKNOWLEDGMENTS

We acknowledge Dr. Sohaib Khan (CEO at Hazen.ai) for provid-ing us valuable feedback on the experimentation part of neuralnetworks. We acknowledge SURFsara of providing us credits toperform experiments.

REFERENCES [1] 2020 (accessed May 8, 2020). Ambient Software Evoluton Group, IJaDataset 2.0.http://secold.org/projects/seclone.[2] 2020 (accessed May 8, 2020). TabNine. Autocompletion with deep learning.https://tabnine.com/blog/deep/.[3] Shamsa Abid. 2019. Recommending related functions from API usage-basedfunction clone structures. In

Proceedings of the 2019 27th ACM Joint Meeting onEuropean Software Engineering Conference and Symposium on the Foundations ofSoftware Engineering . 1193–1195.[4] David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. 1985. A learningalgorithm for Boltzmann machines.

Cognitive science

9, 1 (1985), 147–169.[5] Miltiadis Allamanis. 2019. The adverse effects of code duplication in machinelearning models of code. In

Proceedings of the 2019 ACM SIGPLAN InternationalSymposium on New Ideas, New Paradigms, and Reflections on Programming andSoftware . 143–153.[6] Miltiadis Allamanis and Charles Sutton. 2013. Mining source code repositoriesat massive scale using language modeling. In

Proceedings of the 10th WorkingConference on Mining Software Repositories . IEEE Press, 207–216.[7] Vara Arammongkolvichai, Rainer Koschke, Chaiyong Ragkhitwetsagul, MorakotChoetkiertikul, and Thanwadee Sunetnanta. 2019. Improving Clone DetectionPrecision Using Machine Learning Techniques. In . IEEE, 31–315.[8] Muhammad Asaduzzaman, Chanchal K Roy, Kevin A Schneider, and DaqingHou. 2016. A Simple, Efficient, Context-sensitive Approach for Code Completion.

Journal of Software: Evolution and Process

28, 7 (2016), 512–541. [9] He Bai, Peng Shi, Jimmy Lin, Luchen Tan, Kun Xiong, Wen Gao, Jie Liu, and MingLi. 2020. Semantics of the Unwritten. arXiv preprint arXiv:2004.02251 (2020).[10] Hamid Abdul Basit, Muhammad Hammad, Stan Jarzabek, and Rainer Koschke.2015. What do we need to know about clones? deriving information needs fromuser goals. In .IEEE, 51–57.[11] Hamid Abdul Basit, Muhammad Hammad, and Rainer Koschke. 2015. A surveyon goal-oriented visualization of clone data. In . IEEE, 46–55.[12] Hamid Abdul Basit and Stan Jarzabek. 2005. Detecting higher-level similaritypatterns in programs.

ACM Sigsoft Software engineering notes

30, 5 (2005), 156–165.[13] Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo.2007. Comparison and evaluation of clone detection tools.

IEEE Transactions onsoftware engineering

33, 9 (2007), 577–591.[14] Yoshua Bengio, R´ejean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. Aneural probabilistic language model.

Journal of machine learning research

3, Feb(2003), 1137–1155.[15] Avishkar Bhoopchand, Tim Rockt¨aschel, Earl Barr, and Sebastian Riedel. 2016.Learning python code suggestion with a sparse pointer network. arXiv preprintarXiv:1611.08307 (2016).[16] Brendon Boldt. 2017. Using LSTMs to Model the Java Programming Language.In

International Conference on Artificial Neural Networks . Springer, 268–275.[17] James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. 2016.Quasi-recurrent neural networks. arXiv preprint arXiv:1611.01576 (2016).[18] Zimin Chen, Steve James Kommrusch, Michele Tufano, Louis-No¨el Pouchet,Denys Poshyvanyk, and Martin Monperrus. 2019. Sequencer: Sequence-to-sequence learning for end-to-end program repair.

IEEE Transactions on SoftwareEngineering (2019).[19] Kyunghyun Cho, Bart Van Merri¨enboer, Caglar Gulcehre, Dzmitry Bahdanau,Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phraserepresentations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).[20] Mathias Creutz, Teemu Hirsim¨aki, Mikko Kurimo, Antti Puurula, JannePylkk¨onen, Vesa Siivola, Matti Varjokallio, Ebru Arisoy, Murat Sarac¸lar, andAndreas Stolcke. 2007. Morph-based speech recognition and modeling of out-of-vocabulary words across languages.

ACM Transactions on Speech and LanguageProcessing (TSLP)

5, 1 (2007), 1–29.[21] George E Dahl, Dong Yu, Li Deng, and Alex Acero. 2011. Context-dependentpre-trained deep neural networks for large-vocabulary speech recognition.

IEEETransactions on audio, speech, and language processing

20, 1 (2011), 30–42.[22] Hoa Khanh Dam, Truyen Tran, and Trang Pham. 2016. A deep language modelfor software code. arXiv preprint arXiv:1608.02715 (2016).[23] Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical neural storygeneration. arXiv preprint arXiv:1805.04833 (2018).[24] Jessica Ficler and Yoav Goldberg. 2017. Controlling linguistic style aspects inneural language generation. arXiv preprint arXiv:1707.02633 (2017).[25] Mark Gabel and Zhendong Su. 2010. A study of the uniqueness of sourcecode. In

Proceedings of the eighteenth ACM SIGSOFT international symposium onFoundations of software engineering . 147–156.[26] Philip Gage. 1994. A new algorithm for data compression.

C Users Journal

12, 2(1994), 23–38.[27] Jiatao Gu, Yong Wang, Kyunghyun Cho, and Victor OK Li. 2018. Search engineguided neural machine translation. In

Thirty-Second AAAI Conference on ArtificialIntelligence .[28] Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In .IEEE, 933–944.[29] Kelvin Guu, Tatsunori B Hashimoto, Yonatan Oren, and Percy Liang. 2018. Gen-erating sentences by editing prototypes.

Transactions of the Association forComputational Linguistics

Proceedings of the 2017 11th JointMeeting on Foundations of Software Engineering . ACM, 763–773.[31] Abram Hindle, Earl T Barr, Mark Gabel, Zhendong Su, and Premkumar Devanbu.2016. On the naturalness of software.

Commun. ACM

59, 5 (2016), 122–131.[32] Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory.

Neuralcomputation

9, 8 (1997), 1735–1780.[33] Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. 2019. The curious caseof neural text degeneration. arXiv preprint arXiv:1904.09751 (2019).[34] Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep code commentgeneration. In

Proceedings of the 26th Conference on Program Comprehension .ACM, 200–210.[35] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016.Summarizing source code using a neural attention model. In

Proceedings of the54th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers) . 2073–2083.

36] S´ebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2014.On using very large target vocabulary for neural machine translation. arXivpreprint arXiv:1412.2007 (2014).[37] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu.2016. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410 (2016).[38] Elmar Juergens, Florian Deissenboeck, Benjamin Hummel, and Stefan Wagner.2009. Do code clones matter?. In . IEEE, 485–495.[39] Daniel Jurafsky andJamesH. 2009. Martin. Speech and Language Processing.[40] Cory J Kapser and Michael W Godfrey. 2008. fiCloning considered harmfulfi con-sidered harmful: patterns of cloning in software.

Empirical Software Engineering

13, 6 (2008), 645.[41] Rafael-Michael Karampatsis, Hlib Babii, Romain Robbes, Charles Sutton, andAndrea Janes. 2020. Big Code!= Big Vocabulary: Open-Vocabulary Models forSource Code. (2020).[42] Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. 2018. Sharp nearby,fuzzy far away: How neural language models use context. arXiv preprintarXiv:1805.04623 (2018).[43] Seohyun Kim, Jinman Zhao, Yuchi Tian, and Satish Chandra. 2020. Code Predic-tion by Feeding Trees to Transformers. arXiv preprint arXiv:2003.13848 (2020).[44] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica-tion with deep convolutional neural networks. In

Advances in neural informationprocessing systems . 1097–1105.[45] Polina Kuznetsova, Vicente Ordonez, Alexander Berg, Tamara Berg, and YejinChoi. 2013. Generalizing image captions for image-text parallel corpus. In

Proceedings of the 51st Annual Meeting of the Association for ComputationalLinguistics (Volume 2: Short Papers) . 790–796.[46] Jieh-Sheng Lee and Jieh Hsiang. 2019. Patent Claim Generation by Fine-TuningOpenAI GPT-2. arXiv preprint arXiv:1907.02052 (2019).[47] Jiwei Li, Will Monroe, Tianlin Shi, S´ebastien Jean, Alan Ritter, and Dan Juraf-sky. 2017. Adversarial learning for neural dialogue generation. arXiv preprintarXiv:1701.06547 (2017).[48] Liuqing Li, He Feng, Wenjie Zhuang, Na Meng, and Barbara Ryder. 2017.Cclearner: A deep learning-based clone detection approach. In . IEEE,249–260.[49] Rebecca Mason and Eugene Charniak. 2014. Domain-specific image captioning.In

Proceedings of the Eighteenth Conference on Computational Natural LanguageLearning . 11–20.[50] Pawel Matuszyk, Renˆe Tatua Castillo, Daniel Kottke, and Myra Spiliopoulou.2016. A Comparative Study on Hyperparameter Optimization for RecommenderSystems. In

Workshop on Recommender Systems and Big Data Analytics (RS-BDA’16).fi2016.fiS . 13–21.[51] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016.Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843 (2016).[52] Tom´aˇs Mikolov, Anoop Deoras, Stefan Kombrink, Luk´aˇs Burget, and JanˇCernock`y. 2011. Empirical evaluation and combination of advanced languagemodeling techniques. In

Twelfth Annual Conference of the International SpeechCommunication Association .[53] Tom´aˇs Mikolov, Martin Karafi´at, Luk´aˇs Burget, Jan ˇCernock`y, and Sanjeev Khu-danpur. 2010. Recurrent neural network based language model. In

Eleventhannual conference of the international speech communication association .[54] Lili Mou, Rui Men, Ge Li, Lu Zhang, and Zhi Jin. 2015. On end-to-end pro-gram generation from user intention by deep neural networks. arXiv preprintarXiv:1510.07211 (2015).[55] Hiroaki Murakami, Yoshiki Higo, and Shinji Kusumoto. 2014. A dataset of clonereferences with gaps. In

Proceedings of the 11th Working Conference on MiningSoftware Repositories . ACM, 412–415.[56] Anh Tuan Nguyen and Tien N Nguyen. 2015. Graph-based statistical languagemodel for code. In , Vol. 1. IEEE, 858–868.[57] Tung Thanh Nguyen, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N Nguyen.2013. A statistical semantic language model for source code. In

Proceedings of the2013 9th Joint Meeting on Foundations of Software Engineering . ACM, 532–542.[58] Matthew Peters, Sebastian Ruder, and Noah A Smith. 2019. To tune or notto tune? adapting pretrained representations to diverse tasks. arXiv preprintarXiv:1903.05987 (2019).[59] Karl Pichotta and Raymond J Mooney. 2016. Using sentence-level LSTM languagemodels for script inference. arXiv preprint arXiv:1604.02993 (2016).[60] Alec Radford, Karthik Narasimhan, Tim Salimans, and IlyaSutskever. 2018. Improving language understanding by genera-tive pre-training.

URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper.pdf (2018).[61] Alec Radford, Jeffrey Wu, Dario Amodei, Daniela Amodei, Jack Clark, MilesBrundage, and Ilya Sutskever. 2019. Better language models and their implica-tions.

OpenAI Blog https://openai. com/blog/better-language-models (2019). [62] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and IlyaSutskever. 2019. Language models are unsupervised multitask learners.

OpenAIBlog

1, 8 (2019), 9.[63] Chaiyong Ragkhitwetsagul. 2018.

Code similarity and clone search in large-scalesource code data . Ph.D. Dissertation. UCL (University College London).[64] Baishakhi Ray, Vincent Hellendoorn, Saheel Godhane, Zhaopeng Tu, AlbertoBacchelli, and Premkumar Devanbu. 2016. On the” naturalness” of buggy code.In .IEEE, 428–439.[65] Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code completion withstatistical language models. In

Proceedings of the 35th ACM SIGPLAN Conferenceon Programming Language Design and Implementation . 419–428.[66] Alexander A Razborov. 1992. On small depth threshold circuits. In

ScandinavianWorkshop on Algorithm Theory . Springer, 42–52.[67] Corbin Rosset, Chenyan Xiong, Xia Song, Daniel Campos, Nick Craswell, SaurabhTiwary, and Paul Bennett. 2020. Leading Conversational Search by SuggestingUseful Questions. In

Proceedings of The Web Conference 2020 . 1160–1170.[68] Chanchal Kumar Roy and James R Cordy. 2007. A survey on software clonedetection research.

Queenfis School of Computing TR

Proceedings of the 2015 10th Joint Meeting onFoundations of Software Engineering . 191–201.[70] Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K Roy, and Cristina VLopes. 2016. Sourcerercc: Scaling code clone detection to big-code. In . IEEE,1157–1168.[71] Abigail See, Aneesh Pappu, Rohun Saxena, Akhila Yerukola, and Christopher DManning. 2019. Do Massively Pretrained Language Models Make Better Story-tellers? arXiv preprint arXiv:1909.10705 (2019).[72] Claude E Shannon. 1948. A mathematical theory of communication.

Bell systemtechnical journal

27, 3 (1948), 379–423.[73] Claude E Shannon. 1951. Prediction and entropy of printed English.

Bell systemtechnical journal

30, 1 (1951), 50–64.[74] Louis Shao, Stephan Gouws, Denny Britz, Anna Goldie, Brian Strope, and RayKurzweil. 2017. Generating high-quality and informative conversation responseswith sequence-to-sequence models. arXiv preprint arXiv:1701.03185 (2017).[75] Susan Elliott Sim, Charles LA Clarke, and Richard C Holt. 1998. Archetypalsource code searches: A survey of software developers and maintainers. In

Proceedings. 6th International Workshop on Program Comprehension. IWPC’98(Cat. No. 98TB100242) . IEEE, 180–187.[76] Yiping Song, Rui Yan, Xiang Li, Dongyan Zhao, and Ming Zhang. 2016. Two arebetter than one: An ensemble of retrieval-and generation-based dialog systems. arXiv preprint arXiv:1610.07149 (2016).[77] Martin Sundermeyer, Ralf Schl¨uter, and Hermann Ney. 2012. LSTM neural net-works for language modeling. In

Thirteenth annual conference of the internationalspeech communication association .[78] Jeffrey Svajlenko, Judith F Islam, Iman Keivanloo, Chanchal K Roy, and Moham-mad Mamun Mia. 2014. Towards a big data curated benchmark of inter-projectcode clones. In . IEEE, 476–480.[79] Jeffrey Svajlenko and Chanchal K Roy. 2015. Evaluating clone detection tools withbigclonebench. In . IEEE, 131–140.[80] Jeffrey Svajlenko and Chanchal K Roy. 2016. Bigcloneeval: A clone detection toolevaluation framework with bigclonebench. In . IEEE, 596–600.[81] Jeffrey Svajlenko and Chanchal K Roy. 2016. A machine learning based approachfor evaluating clone detection tools for a generalized and accurate precision.

International Journal of Software Engineering and Knowledge Engineering

Proceedings of the 25th ACM SIGKDDInternational Conference on Knowledge Discovery & Data Mining . 2727–2735.[83] Jan E Trost. 1986. Statistically nonrepresentative stratified sampling: A samplingtechnique for qualitative studies.

Qualitative sociology

9, 1 (1986), 54–57.[84] Zhaopeng Tu, Zhendong Su, and Premkumar Devanbu. 2014. On the localnessof software. In

Proceedings of the 22nd ACM SIGSOFT International Symposiumon Foundations of Software Engineering . 269–280.[85] Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, MartinWhite, and Denys Poshyvanyk. 2018. Deep learning similarities from differentrepresentations of source code. In . IEEE, 542–553.[86] Ashwin K Vijayakumar, Michael Cogswell, Ramprasaath R Selvaraju, Qing Sun,Stefan Lee, David Crandall, and Dhruv Batra. 2018. Diverse beam search forimproved description of complex scenes. In

Thirty-Second AAAI Conference onArtificial Intelligence .[87] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In

Advances in neural information processing systems . 2692–2700.

88] Wei Wang and Michael W Godfrey. 2014. Recommending clones for refactoringusing design, context, and history. In . IEEE, 331–340.[89] Huihui Wei and Ming Li. 2017. Supervised Deep Features for Software FunctionalClone Detection by Exploiting Lexical and Syntactical Information in SourceCode.. In

IJCAI . 3034–3040.[90] Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk.2016. Deep learning code fragments for code clone detection. In

Proceedings ofthe 31st IEEE/ACM International Conference on Automated Software Engineering .ACM, 87–98.[91] Martin White, Christopher Vendome, Mario Linares-V´asquez, and Denys Poshy-vanyk. 2015. Toward deep learning software repositories. In

Proceedings of the12th Working Conference on Mining Software Repositories . IEEE Press, 334–345.[92] Yixiao Yang, Yu Jiang, Ming Gu, Jiaguang Sun, Jian Gao, and Han Liu. 2017.A language model for statements of software code. In

Proceedings of the 32ndIEEE/ACM International Conference on Automated Software Engineering . IEEEPress, 682–687.[93] Norihiro Yoshida, Seiya Numata, Eunjong Choiz, and Katsuro Inoue. 2019. Proac-tive clone recommendation system for extract method refactoring. In . IEEE, 67–70.[94] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neuralnetwork regularization. arXiv preprint arXiv:1409.2329 (2014).[95] Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford,Dario Amodei, Paul Christiano, and Geoffrey Irving. 2019. Fine-tuning languagemodels from human preferences. arXiv preprint arXiv:1909.08593 (2019).(2019).