[PDF] Towards Semantic Clone Detection via Probabilistic Software Modeling

Abstract

Semantic clones are program components with similar behavior, but different textual representation. Semantic similarity is hard to detect, and semantic clone detection is still an open issue. We present semantic clone detection via Probabilistic Software Modeling (PSM) as a robust method for detecting semantically equivalent methods. PSM inspects the structure and runtime behavior of a program and synthesizes a network of Probabilistic Models (PMs). Each PM in the network represents a method in the program and is capable of generating and evaluating runtime events. We leverage these capabilities to accurately find semantic clones. Results show that the approach can detect semantic clones in the complete absence of syntactic similarity with high precision and low error rates.

Full PDF

TTowards Semantic Clone Detectionvia Probabilistic Software Modeling

Hannes Thaller, Lukas Linsbauer, Alexander Egyed

Institute for Software Systems EngineeringJohannes Kepler University Linz, Austria{hannes.thaller, lukas.linsbauer, alexander.egyed}@jku.at

Abstract —Semantic clones are program components with sim-ilar behavior, but different textual representation. Semanticsimilarity is hard to detect, and semantic clone detection isstill an open issue. We present semantic clone detection viaProbabilistic Software Modeling (PSM) as a robust method fordetecting semantically equivalent methods. PSM inspects thestructure and runtime behavior of a program and synthesizes anetwork of Probabilistic Models (PMs). Each PM in the networkrepresents a method in the program and is capable of generatingand evaluating runtime events. We leverage these capabilities toaccurately ﬁnd semantic clones. Results show that the approachcan detect semantic clones in the complete absence of syntacticsimilarity with high precision and low error rates.

Index Terms —clone detection, semantic clone detection, prob-abilistic modeling, multivariate testing, software modeling, staticcode analysis, dynamic code analysis, runtime monitoring, infer-ence, simulation, deep learning

I. I

NTRODUCTION

Copying and pasting source code fragments leads to codeclones. Code clones are considered an anti-pattern as theyincrease maintenance costs, promote bad software design, andpropagate bugs [1], [2], [3], [4], [5], [6], [7], [8]. Code clonesare traditionally split into four categories. Type 1-3 [9], [10],[11] code clones are textual copies of a program fragment withpossible changes. Type 4 code clones are behavioral copies ofa program fragment that do not have any syntactic similaritybut implement the same functionality (semantic equivalence).For example, the iterative and recursive implementations ofthe Fibonacci algorithm have no syntactic similarity whileimplementing the same functionality.Juergens et al. [12] have shown that existing tools onlyhave limited capabilities for detecting Type 4 clones. Thislimitation can also be seen in various clone detection toolcomparisons [13], [14], [9], [10], [15] through the absenceor explicit exclusion of Type 4 clones. Nevertheless, Type 4clones exist and tools for detecting them are needed [12], [16].We present

Semantic Clone Detection via ProbabilisticSoftware Modeling (SCD-PSM) . SCD-PSM detects semanticclones with no textual and structural similarity. First, a networkof Probabilistic Models (PMs) is built via Probabilistic SoftwareModeling (PSM) [17]. Each PM models an executable (e.g.,a method in Java) in the program under analysis. SCD-PSMleverages these PMs and their inferential capabilities to detectsemantically equivalent executables. Probabilistic inferenceenables a similarity measure based on probabilities. These probabilities are used to conduct statistical tests (GeneralizedLikelihood Ratio Test) that produce the ﬁnal clone decision.II. B

ACKGROUND

A basic understanding of clone detection and probabilisticsoftware modeling is needed to understand the approach. Wewill use a monospace font to refer to program elements(e.g., factorial_a ) and italics to refer to the correspondingmodel elements (e.g., factorial_a ). A. Clone Detection

Clone detection is the process of ﬁnding pairs of similarprogram fragments as illustrated in Figure 1. Figure 1 showsthree different implementations of the factorial computation.Figure 1a uses a for loop, while Figure 1b uses a while loopimplementation. Finally, Figure 1c uses recursion to computethe factorial of n . The clone detection process includes therepresentation (e.g., text fragments), pairing (e.g., of textfragments of similar size), the similarity evaluation (e.g.,counting the differences in the text fragments), and the clonedecision (e.g., less than 10 differences). Representations can be, for example, text (e.g., source code),graphs (e.g., AST), or probabilistic models (like in this work).

Pairing is the process of selecting two code fragments thatare potentially a clone. Each pair is called a candidate clonepair (or candidate pair). The similarity evaluation measuresthe similarity between the fragments of a candidate pair. The clone decision labels the candidate pair as a clone given thatthe similarity fulﬁlls some criteria.The properties of the similarity metric splits clones into twogroups [9]. Type 1-3 clones capture textual similarity whileType 4 clones capture semantic similarity [10], [14], [18],[9], [11], [19]. These types are increasingly challenging todetect, with Type 4 being the most complex one. Figure 1a andFigure 1b are an instance of a Type 3 clone while Figure 1a (orFigure 1b) and 1c are an instance of a Type 4 clone. Note, thatthe deﬁnition of a semantic clone is often relaxed where up-to syntactic similarity of the code fragments is allowed [13],[20]. However, we consider these clones as complex Type 3clones (additions, deletions, reordering) and not as semanticclones. This means that semantic clones in the context ofthis work are clones with no syntactic similarity except forper-chance similarities (e.g., equal parameter names). a r X i v : . [ c s . S E ] J a n actorial_a(n){ product = 1 for (i = 1; i <= n; i++){ product *= i } return product} (a) A for implementation of factorial. factorial_b(n){ product = 1 i = 1 while (i <= n){ product *= i i++ } return product} (b) A while implementation of factorial. factorial_c(n){ if (n <= 1){ return

1 } return factorial_c(n - 1) * n} (c) A recursive implementation of factorial.

Figure 1: The for and while implementations are complex Type 3 clones in which new lines were added and some changed.The recursive implementation is a Type 4 clone of the for and while implementations without any syntactic resemblance.

B. Probabilistic Software Modeling

Probabilistic Software Modeling (PSM) [17] is a data-drivenmodeling paradigm that transforms a program into a network ofProbabilistic Models (PMs). PSM extracts a program’s structuregiven by types, properties, and executables (e.g., classes, ﬁelds,and methods respectively in Java). This structure includes thecall dependencies between the different code elements whichdeﬁnes the topology of the PM network. Each PM is optimizedtowards a program execution. The program execution can eitherbe synthetic (e.g., random testing), from tests (e.g., developertests), or from the program in its production environment. Inthe context of clone detection, synthetic program executionssufﬁce as the results are based on differential comparisons oftwo elements.Each PM represents an executable (e.g., a method in Java)in the program. Inputs are parameters, property reads, andinvocation return values. Outputs are the method return value,property writes, and invocation parameters. The distinctionbetween inputs and outputs is only a logical view from a soft-ware engineering perspective. The actual PMs are multivariatedensity estimators without such distinction (joint model of allvariables). PMs can generate observations that are similar tothe initial training data. More importantly, each model canevaluate the likelihood of data. The likelihood is used todetect behavioral equivalence between models, which is thengeneralized to the semantic equivalence between executablesin the program.The PMs in the network are real Non-Volume Preservingtransformations (NVPs) [21], a generative likelihood-basedlatent-variable model for density estimation. NVPs learna function that maps data to a known latent-space, e.g.,input parameter values n and return values product of factorial_a , to a bivariate normal distributions. Moreformally, each NVP is a neural network that learns a bijectivefunction f : X (cid:55)→ Z (with g = f − ) between the originaldata x ∈ X and predeﬁned latent-variables z ∈ Z . The latent-variables are selected, such that sampling, conditioning, andlikelihood evaluation is efﬁcient and straightforward, e.g., viaan isotropic unit norm Gaussian N (0 , ) . Sampling generates data x ∈ X by drawing observationsfrom the latent-variables z ∼ Z and inverting them via the NVP to the original data-space x = g ( z ) ∼ X . Conditioning ﬁnds a latent-space conﬁguration (i.e., a latent-code) ˆ z such that the associated data g ( ˆ z ) = ˆ x satisﬁes agiven condition. First a proposal code is drawn form the latent-space ˆ z which is then inverted to its data form ˆ x = g ( ˆ z ) .Then the error is measured on the conditioned dimensionsvia, e.g., Mean Squared Error (MSE). The error is usedto update the latent code ˆ z and the procedure is repeateduntil convergence. For example, one can condition the returnvalue from factorial_a on to the return value of the fibonacci method. First, samples are drawn from the factorial_a model retaining only the dimension associated withthe return value. Then, samples are drawn from the ﬁbonacci model and the error between the return value dimensions iscomputed. This error is back-propagated to the latent-codewhich is updated according to the errors. After convergence ofthe optimization the ﬁbonacci sample contains the same returnvalues as imposed by the factorial_a sample. Furthermore, theremaining dimension n is resampled (imputed) in such a waythat it adheres to the joint relationship of all the variablesin ﬁbonacci . Finally, ﬁbonacci can be used to evaluate thelikelihood of the conditioned sample.III. A PPROACH

SCD-PSM uses the models built by PSM and comparesthem for behavioral equivalence. The behavioral equivalence isthen generalized to semantic equivalence of executables (i.e.,methods).

A. Similarity Evaluation

The similarity evaluation computes the cross-wise likelihoodof the models by sampling and conditioning . Given is apair of candidate PMs, each representing an executable. Thesimilarity evaluation starts by selecting a reference model (null-model) M null and an alternative model (alt-model) M alt . Then,null-dimensions M null k and alt-dimensions M alt k are selectedfrom the models, e.g., parameter n from factorial_a iscompared to parameter n of factorial_b . Then, a referencesample D null k is generated by M null as illustrated in Figure 2(1) representing the behavior of M null . This reference sampleis used to generate a conditioned alternative sample D alt | null (2) representing the behavior of M alt given that dimensions k ull alt c o n d i t i o n i n g ConditionalSample nullalt samplingsampling

Likelihood RatioSimilarity Evaluation Clone DecisionMarginalSample pooling

CloneDecision c o n d i t i o n i n g Link ALink B nullaltnullalt Figure 2: SCD-PSM evaluates the similarity of a pair of modelsvia their data likelihood. The likelihoods are combined intothe ﬁnal clone decision.are ﬁxed to the behavior of M null k (3). Finally, the likelihood of D null is evaluated under M null , resulting in the base likelihoodof the reference sample under the null-model LL null , and D alt | null is evaluated under M alt , resulting in the likelihoodof the conditioned alternative sample under the alt-model LL alt | null . Then, the null and alt roles are swapped and theprocedure is repeated (see Figure 2 Link B).The swapping of roles is necessary because of sub-modelrelationships. For instance, one model returns data distributedaccording to N (0 , and the other according to N (0 , . Onelink will lead to a high likelihood (sub-model is null) while theother link will result in low likelihood (super-model is null).In conclusion, the similarity evaluation tests the likelihoodof the models in the context of each other. The ﬁnal clonedecision is based on these likelihood values. B. Clone Decision

The ﬁnal step is to combine the likelihood values from thesimilarity evaluation to a ﬁnal decision as shown in Figure 2).The two likelihood ratios (4) are combined by a pooling operator (5) and compared against a critical value yieldingthe ﬁnal clone decision (6).More formally, the procedure makes use of the GeneralizedLikelihood Ratio Test (GLRT) [22]. The log-GLRT measureswhether the log-likelihoods are signiﬁcantly different from 0with λ = LL alt − LL null , (1)where LL is the log-likelihood. The null hypothesis is that themodels are equal. It is rejected for small ratios λ ≤ c where c is set to an appropriate Type 1 error, i.e., false-positive rate. Forexample, λ < log(0 . allows 1 out of 100 candidates to bea false positive, i.e., wrongly rejecting semantic equivalence.The Clone Decision (6) is computed by pooling (5) the linkresults.

Hard pooling accepts the candidate pair as a clone ifthe null hypothesis for both links could not be rejected.

Softpooling accepts the candidate pair as a clone if the averagelog-likelihood ratio of both links cannot be rejected. Hard Table I: The 8 subject examples used in the evaluation.

Subject Style Clone Class Parameter Executable

Factorial iterative A

Factorial recursive A

Fibonacci iterative B

Fibonacci recursive B

BubbleSort iterative C

BubbleSort recursive C

MergeSort iterative C

MergeSort recursive C pooling does not allow any sub-model relationship, while thesoft pooling relaxes this condition slightly.The ﬁnal requirement is that a candidate pair is only acceptedas a clone if the selected dimensions k of both, M null and M alt contain at least one input and output dimension. That is,methods are semantically equivalent if at least parts of theirinput and output relationship is equivalent.In conclusion, the clone decision combines the link resultsand controls the results for a predeﬁned false positive rate.IV. S TUDY

We implemented a prototype for SCD on top of PSM andapplied the similarity evaluation given in Section III.1) The study uses 8 well-known algorithms listed in Table Idistributed in 3 clone classes. Each clone class is awell-understood example of semantic clones with 0%syntactic similarity. Each subject was triggered withpositive uniform distributed random values.2) The

Probabilistic Model Network was computed via Gra-dient, a PSM prototype [17]. The same hyper-parameterswere selected as in our previous reported experiments.3) The

Candidate Clone Pairs were all combinations ofdimensions of the PMs. The candidate pairs were formedfrom all 8 subject systems.4) Each valid candidate pair was tested for behavioral equalityby cross-wise likelihood evaluation described in SectionIII-A.5) The clone decision was computed via the GLRT and theresults were pooled as described in Section III-B.

A. Controlled Variables

The study controls for pooling , the

Type 1 error , andthe number of particles used in the similarity evaluation(Section III-A).

Pooling describes how likelihoods are combined to the ﬁnalclone decision {soft, hard} (see Section III-B).

Type 1 error, or the false-positive rate, deﬁnes the critical value c at which clones are considered signiﬁcantly different{0.001, 0.01} (Section III-B). The critical value is the totalType 1 for both links. umber of Particles are the number of samples that aresampled during the similarity evaluation for the referencesample D null and the alternative sample D alt . A lownumber of particles is faster to compute but has a highervariance in the results. B. Response Variables

The performance of the clone detection is measured via precision , recall , and the balanced accuracy . These metricsare computed by the True Positive (TP), False Positive (FP),True Negative (TN), and False Negative (FN) proportion ofdetected clone instances, e.g., correctly identifying a clone paircounts towards TP. Precision measures the performance to detect only relevantinstances given by

T PT P + F P (2)

Recall measures the performance of detecting all relevantinstances given by

T PT P + F N (3)

Balanced Accuracy measures the performance of detectingrelevant and irrelevant instances but considers a possibleimbalance between the number of relevant and irrelevantinstances. It is given by

T PT P + F N + T NT N + F P (4) C. Experiment Results

The study results are given in Table II. Average precision was . , recall was . , and the balanced accuracy was . .The precision across experiments was excellent, indicating thatmodels can reliably detect behavioral equality. This is reﬂectedin the low number of FPs. However, the FNs indicate thatsome positive examples are missed. Reducing the Type 1 error,i.e., falsely rejecting semantic equality, improves on the FNs.The recall was good for most evaluates. However, hard poolingcaused a slight drop in the recall. The balanced accuracy isgood to excellent for most experiment conﬁgurations. Perfectscores are given for experiment 1 and 12.No effect of Pooling, Type 1, and Particles on the accuracycan be seen. V. D ISCUSSION

The results from Section IV-C are encouraging. The generalperformance was good to excellent. No signiﬁcant differencebetween the different levels of pooling, Type 1, and Particles inTable II can be seen. However, a larger sample size is needed toprecisely attribute effects on the performance. In the -particlesetting a higher variance of performance can be seen causedby per-chance errors. The number of FPs is in all experimentslow which is expected given that the Type 1 error was set to . and . . In contrast, the number of FNs is acceptable.This is reﬂected in the Recall that ranges from . to . The balanced accuracy shows high detection rates of the approachin most experiments settings.VI. L IMITATIONS

SCD-PSM inherits the limitations of PSM. PSM modelsdata. Object references are handles to containers (objects) thatstore data. Thereby, SCD-PSM cannot detect semantic clonesof executables that solely manage object references, e.g., acollection library. However, this limitation does only hold ifthe program never accesses the underlying data. Furthermore,PSM explodes lists into singular values since distributions donot contain any order information. This means executablesthat change the order of sequences are matched based on thevalues, not their order. As a consequence, invoking a wronglyimplemented, e.g., sorting algorithm, would result in a falsepositive. Extending PSM to model distributions of sequenceswill alleviate this issue.A limitation of the detection process is that it is built onruntime observations. This means that the approach can onlybe applied to runnable source code.The ﬁnal limitation is that the approach cannot detect Type2-3 clones. Slight changes, e.g., ﬂipping a plus sign to a minus,have large implications on the resulting runtime behavior. Thesechanges will impact the semantic detection process such thatthe candidate clone pair will not be accepted. For example,common clone detectors will report Listing 1 and Listing 2 asclones since they differ only by one character (ignoring namesand reducing minimum size). However, this does not hold forType 4 detectors because the input and output relationshipis different. In contrast, many clone detectors will not detectListing 1 and 3 as clones because of the many additions. Type 4detectors will report this pair as clones since the behavior ofadding one to the input is identical. This hints that Type 2-3and Type 4 clones represent detached concepts that share lesscommon ground than expected. More importantly, this raisesthe question whether existing detectors that report Type 3-4detection capabilities generalize as expected. i n c ( a : I n t ) : I n t { return a + 1}

Listing 1: Increment method dec ( a : I n t ) : I n t { return a − Listing 2: Decrement method i n c ( a : I n t ) : I n t {b = 1 ∗ − return ( I n t ) a + d} Listing 3: Complicated increment methodable II: Results of the clone detection experiments.

Controlled Variables Response Variables

Pooling Type I Particles TP FP TN FN Precision Recall Balanced Accuracy1 hard 0.001 10 22 0 14 0 1.00 1.00 1.002 hard 0.001 50 18 0 10 8 1.00 0.69 0.783 hard 0.001 100 20 0 12 4 1.00 0.83 0.894 soft 0.001 10 14 0 18 4 1.00 0.78 0.895 soft 0.001 50 22 0 10 4 1.00 0.85 0.896 soft 0.001 100 22 0 10 4 1.00 0.85 0.897 hard 0.010 10 8 0 26 2 1.00 0.80 0.948 hard 0.010 50 14 0 14 8 1.00 0.64 0.789 hard 0.010 100 14 0 14 8 1.00 0.64 0.7810 soft 0.010 10 16 2 10 8 0.89 0.67 0.7211 soft 0.010 50 20 0 12 4 1.00 0.83 0.8912 soft 0.010 100 22 0 14 0 1.00 1.00 1.00

VII. R

ELATED W ORK

Many studies have evaluated textual clones. However, thereare only a few studies reporting reliable results on semanticclones without relaxing the deﬁnition of Type 4.Rattan [11] et al. provided a review of clone detectionstudies. The review also investigated approaches that tackleType 4 clones. They conclude that some approaches solveapproximations (i.e., complex Type 3 clones) of Type 4 clones.Horwitz [23] detected textual and semantic differencesin programs via a Program Representation Graph, whichis similar to a Program Dependency Graph (PDG). PDG-based approaches [18], [24], [25] use (static) data and controldependencies to ﬁnd similar sub-graphs between the candidates.They can detect complex Type 3 clones, e.g., Figure 1a andFigure 1b. However, the compared PDG sub-graphs are arepresentation of the source code; thereby, the approaches stillrely on syntactic similarity [26].Another category of semantic clone detectors are test-basedmethods. Test-based methods randomly trigger the executionof two candidates and measure whether equal inputs causesimilar outputs. Jiang and Su [27] were able to detect semanticclones without syntactical similarities. A similar approachwas presented by Deissenboeck et al. [28]. One issue withtest-based clone detection is that candidates need a similarsignature. Differences in data types or the number of parameterscan not be effectively handled by the test-case generators orthe similarity measurement. SCD-PSM works similar to test-based methods in that it observes the runtime and comparesthe resulting behavior. However, SCD-PSM builds generativemodels from the observed behavior capable of generatingand evaluating data. Missing dimensions are imputed byconditioning and sampling. This allows SCD-PSM to overcomethe issue of signature mismatches. Furthermore, PSM abstractsthe data types into text, integer, and ﬂoats mitigating data typemismatches.Finally, the clone detector Oreo [20] has also reported Type 3to Type 4 detection capabilities. Oreo uses a combinationof representations and detection stages to ﬁnd clones. Mostimportant is the semantic similarity comparison based on actions a method takes, e.g., accessing an array, writing aproperty, or invoking a method. These actions correspond,to some extend, to the dimensions of PSM models, i.e.,represent entry-points of information (e.g., ﬁeld accesses,invocations, etc.) Oreo counts these entry-points and comparesthen between the fragments in a candidate pair. No analysis ofthe runtime assignments is conducted, nor is the relationshipbetween the actions analyzed like SCD-PSM does. Oreoreports many complex Type 3 and Type 4 clones up to 50%syntactic similarity based on this semantic similarity (and theadditional pipeline steps). However, more research is needed toidentify the weaknesses and strengths of both approaches. Thishighlights the need for a hard but well understood baselinedataset of Type 4 clones similar to the examples in our studybut extended with a larger variety of semantic clones.VIII. C

ONCLUSION AND F UTURE W ORK

In this work, we presented a viable approach for semanticclone detection - Semantic Clone Detection via ProbabilisticSoftware Modeling (SCD-PSM). SCD-PSM leverages the PMsof PSM to detect method level semantic clones with 0%syntactic similarity.We have discussed the similarity evaluation and the clonedecision that represent the central aspect of a clone detector.We evaluated the concepts on a set of well-known semanticclones that provide a hard baseline for Type 4 detectors.Our future work is to evaluate the scalability of the approachwith large programs. Furthermore, we want to compare SCD-PSM with existing Type 3 clone detectors.In conclusion, SCD-PSM is capable of detecting semanticclones with 0% syntactic similarity.A

CKNOWLEDGMENTS

The research reported in this paper has been supported by theAustrian ministries BMVIT and BMDW, and the Province ofUpper Austria in terms of the COMET - Competence Centersfor Excellent Technologies Programme managed by FFG.

EFERENCES[1] Mayrand, Leblanc, and Merlo, “Experiment on the automatic detectionof function clones in a software system using metrics,” in

Proceedings ofInternational Conference on Software Maintenance ICSM-96 . Monterey,CA, USA: IEEE, 1996, pp. 244–253.[2] A. Monden, D. Nakae, T. Kamiya, S. Sato, and K. Matsumoto, “Softwarequality analysis by code clones in industrial legacy software,” in

Proceedings Eighth IEEE Symposium on Software Metrics . Ottawa,Ont., Canada: IEEE Comput. Soc, 2002, pp. 87–94.[3] R. C. Martin, Ed.,

Clean Code: A Handbook of Agile SoftwareCraftsmanship . Upper Saddle River, NJ: Prentice Hall, 2009.[4] M. Fowler and K. Beck,

Refactoring: Improving the Design of ExistingCode , ser. The Addison-Wesley Object Technology Series. Reading,MA: Addison-Wesley, 1999.[5] A. Hunt and D. Thomas,

The Pragmatic Programmer: From Journeymanto Master . Reading, Mass: Addison-Wesley, 2000.[6] A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler, “An empiricalstudy of operating systems errors,”

ACM SIGOPS Operating SystemsReview , vol. 35, no. 5, p. 73, Dec. 2001.[7] Z. Li, S. Lu, S. Myagmar, and Y. Zhou, “CP-Miner: Finding Copy-Pasteand Related Bugs in Large-Scale Software Code,”

IEEE Transactionson Software Engineering , vol. 32, no. 3, pp. 176–192, 2006.[8] R. Geiger, B. Fluri, H. C. Gall, and M. Pinzger, “Relation of CodeClones and Change Couplings,” in

Fundamental Approaches to SoftwareEngineering , L. Baresi and R. Heckel, Eds. Berlin, Heidelberg: SpringerBerlin Heidelberg, 2006, vol. 3922, pp. 411–425.[9] C. K. Roy and J. R. Cordy, “A Survey on Software Clone DetectionResearch,”

Queen’s School of Computing TR , vol. 115, p. 115, 2007.[10] S. Bellon, R. Koschke, G. Antoniol, J. Krinke, and E. Merlo, “Comparisonand Evaluation of Clone Detection Tools,”

IEEE Transactions on SoftwareEngineering , vol. 33, no. 9, pp. 577–591, 2007.[11] D. Rattan, R. Bhatia, and M. Singh, “Software clone detection: Asystematic review,”

Information and Software Technology , vol. 55, no. 7,pp. 1165–1199, Jul. 2013.[12] E. Juergens, F. Deissenboeck, and B. Hummel, “Code SimilaritiesBeyond Copy & Paste,” in . Madrid: IEEE, Mar. 2010, pp. 78–87.[13] J. Svajlenko and C. K. Roy, “Evaluating clone detection tools withBigCloneBench,” in . Bremen, Germany: IEEE, Sep.2015, pp. 131–140.[14] R. Koschke, “Survey of research on software clones,” in

Duplication,Redundancy, and Similarity in Software , ser. Dagstuhl Seminar Pro-ceedings, R. Koschke, E. Merlo, and A. Walenstein, Eds., no. 06301.Dagstuhl, Germany: Internationales Begegnungs- und Forschungszentrumfür Informatik (IBFI), Schloss Dagstuhl, Germany, 2007. [15] F. Farmahinifarahani, V. Saini, D. Yang, H. Sajnani, and C. V. Lopes,“On Precision of Code Clone Detection Tools,” in , Feb. 2019, pp. 84–94.[16] V. Kafer, S. Wagner, and R. Koschke, “Are there functionally similarcode clones in practice?” in . Campobasso: IEEE, Mar. 2018, pp. 2–8.[17] H. Thaller, L. Linsbauer, R. Ramler, and A. Egyed, “ProbabilisticSoftware Modeling: A Data-driven Paradigm for Software Analysis,” arXiv:1912.07936 [cs] , Dec. 2019.[18] J. Krinke, “Identifying Similar Code with Program Dependence Graphs,”

Proceedings Eighth Working Conference on Reverse Engineering , pp.301–309, 2001.[19] H. Thaller, R. Ramler, J. Pichler, and A. Egyed, “Exploring codeclones in programmable logic controller software,” in . Limassol: IEEE, Sep. 2017, pp. 1–8.[20] V. Saini, F. Farmahinifarahani, Y. Lu, P. Baldi, and C. V. Lopes, “Oreo:Detection of clones in the twilight zone,” in

Proceedings of the 201826th ACM Joint Meeting on European Software Engineering Conferenceand Symposium on the Foundations of Software Engineering - ESEC/FSE2018 . Lake Buena Vista, FL, USA: ACM Press, 2018, pp. 354–365.[21] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation usingReal NVP,” arXiv:1605.08803 [cs, stat] , May 2016.[22] J. Fan, C. Zhang, and J. Zhang, “Generalized Likelihood Ratio Statisticsand Wilks Phenomenon,”

The Annals of Statistics , vol. 29, no. 1, pp.153–193, 2001.[23] S. Horwitz, “Identifying the Semantic and Textual Differences BetweenTwo Versions of a Program,” in

PLDI , 1990.[24] M. Gabel, L. Jiang, and Z. Su, “Scalable Detection of SemanticClones,” in

Proceedings of the 13th International Conference on SoftwareEngineering - ICSE ’08 . ACM Press, 2008, p. 321.[25] R. Komondoor and S. Horwitz, “Using Slicing to Identify Duplicationin Source Code,” in

Proceedings of the 8th International Symposium onStatic Analysis , ser. SAS ’01. London, UK, UK: Springer-Verlag, 2001,pp. 40–56.[26] S. Wagner, A. Abdulkhaleq, I. Bogicevic, J.-P. Ostberg, and J. Ramadani,“How are functionally similar code clones syntactically different? Anempirical study and a benchmark,”

PeerJ Computer Science , vol. 2, p.e49, Mar. 2016.[27] L. Jiang and Z. Su, “Automatic Mining of Functionally EquivalentCode Fragments via Random Testing,” in

Proceedings of the EighteenthInternational Symposium on Software Testing and Analysis , ser. ISSTA’09. New York, NY, USA: ACM, 2009, pp. 81–92.[28] F. Deissenboeck, L. Heinemann, B. Hummel, and S. Wagner, “Challengesof the Dynamic Detection of Functionally Similar Code Fragments,”in2012 16th European Conference on Software Maintenance andReengineering