Jeffrey Svajlenko | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jeffrey Svajlenko is active.

Explore More

Publication

Featured researches published by Jeffrey Svajlenko.

international conference on software engineering | 2016

SourcererCC: scaling code clone detection to big-code

Hitesh Sajnani; Vaibhav Saini; Jeffrey Svajlenko; Chanchal K. Roy; Cristina Videira Lopes

Despite a decade of active research, there has been a marked lack in clone detection techniques that scale to large repositories for detecting near-miss clones. In this paper, we present a token-based clone detector, SourcererCC, that can detect both exact and near-miss clones from large inter-project repositories using a standard workstation. It exploits an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone. We evaluate the scalability, execution time, recall and precision of SourcererCC, and compare it to four publicly available and state-of-the-art tools. To measure recall, we use two recent benchmarks: (1) a big benchmark of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. We find SourcererCC has both high recall and precision, and is able to scale to a large inter-project repository (25K projects, 250MLOC) using a standard workstation.

international conference on software maintenance | 2014

Evaluating Modern Clone Detection Tools

Jeffrey Svajlenko; Chanchal K. Roy

Many clone detection tools and techniques have been introduced in the literature, and these tools have been used to manage clones and study their effects on software maintenance and evolution. However, the performance of these modern tools is not well known, especially recall. In this paper, we evaluate and compare the recall of eleven modern clone detection tools using four benchmark frameworks, including: (1) Bellons Framework, (2) our modification to Bellons Framework to improve the accuracy of its clone matching metrics, (3) Murakamki et al.s extension of Bellons Framework which adds type 3 gap awareness to the framework, and (4) our Mutation and Injection Framework. Bellons Framework uses a curated corpus of manually validated clones detected by tools contemporary to 2002. In contrast, our Mutation and Injection Framework synthesizes a corpus of artificial clones using a cloning taxonomy produced in 2009. While still very popular in the clone community, there is some concern that Bellons corpus may not be accurate for modern clone detection tools. We investigate the accuracy of the frameworks by (1) checking for anomalies in their results, (2) checking for agreement between the frameworks, and (3) checking for agreement with our expectations of these tools. Our expectations are researched and flexible. While expectations may contain inaccuracies, they are valuable for identifying possible inaccuracies in a benchmark. We find anomalies in the results of Bellons Framework, and disagreement with both our expectations and the Mutation Framework. We conclude that Bellons Framework may not be accurate for modern tools, and that an update of its corpus with clones detected by the modern tools is warranted. The results of the Mutation Framework agree with our expectations in most cases. We suggest that it is a good solution for evaluating modern tools.

international conference on software maintenance | 2014

Towards a Big Data Curated Benchmark of Inter-project Code Clones

Jeffrey Svajlenko; Judith F. Islam; Iman Keivanloo; Chanchal K. Roy; Mohammad Mamun Mia

Recently, new applications of code clone detection and search have emerged that rely upon clones detected across thousands of software systems. Big data clone detection and search algorithms have been proposed as an embedded part of these new applications. However, there exists no previous benchmark data for evaluating the recall and precision of these emerging techniques. In this paper, we present a Big Data clone detection benchmark that consists of known true and false positive clones in a Big Data inter-project Java repository. The benchmark was built by mining and then manually checking clones of ten common functionalities. The benchmark contains six million true positive clones of different clone types: Type-1, Type-2, Type-3 and Type-4, including various strengths of Type-3 similarity (strong, moderate, weak). These clones were found by three judges over 216 hours of manual validation efforts. We show how the benchmark can be used to measure the recall and precision of clone detection techniques.

international workshop on software clones | 2013

A mutation analysis based benchmarking framework for clone detectors

Jeffrey Svajlenko; Chanchal K. Roy; James R. Cordy

In recent years, an abundant number of clone detectors have been proposed in literature. However, most of the tool papers have lacked a solid performance evaluation of the subject tools. This is due both to the lack of an available and reliable benchmark, and the manual efforts required to hand check a large number of candidate clones. In this tool demonstration paper we show how a mutation analysis based benchmarking framework can be used by developers and researchers to evaluate clone detection tools at a fine granularity with minimal effort.

international workshop on software clones | 2013

Scaling classical clone detection tools for ultra-large datasets: an exploratory study

Jeffrey Svajlenko; Iman Keivanloo; Chanchal K. Roy

Detecting clones from large datasets is an interesting research topic for a number of reasons. However, building scalable clone detection tools is challenging and it is often impossible to use existing state of the art tools for such large datasets. In this research we have investigated the use of our Shuffling Framework for scaling classical clone detection tools to ultra large datasets. This framework achieves scalability on standard hardware by partitioning the dataset and shuffling the partitions over a number of detection rounds. This approach does not require modification to the subject tools, which allows their individual strengths and precisions to be captured at an acceptable loss of recall. In our study, we explored the performance and applicability of our framework for six clone detection tools. The clones found during our experiment were used to comment on the cloning habits of the global Java open-source development community.

Journal of Software: Evolution and Process | 2015

Big data clone detection using classical detectors: an exploratory study

Jeffrey Svajlenko; Iman Keivanloo; Chanchal K. Roy

Big data analysis is an emerging research topic in various domains, and clone detection is no exception. The goal is to create big data inter‐project clone corpora across open‐source or corporate‐source code repositories. Such corpora can be used to study developer behavior and to reduce engineering costs by extracting globally duplicated efforts into new APIs and as a basis for code completion and API usage support. However, building scalable clone detection tools is challenging. It is often impractical to use existing state‐of‐the‐art tools to analyze big data because the memory and execution time required exceed the average users resources. Some tools have inherent limitations in their data structures and algorithms that prevent the analysis of big data even when extraordinary resources are available. These limitations are impossible to overcome if the source code of the tool is unavailable or if the user lacks the time or expertise to modify the tool without harming its performance or accuracy. In this research, we have investigated the use of our shuffling framework for scaling classical clone detection tools to big data. The framework achieves scalability on commodity hardware by partitioning the input dataset into subsets manageable by the tool and computing resources. A non‐deterministic process is used to randomly ‘shuffle’ the contents of the dataset into a series of subsets. The tool is executed for each subset, and its output for each is merged into a single report. This approach does not require modification to the subject tools, allowing their individual strengths and precision to be captured at an acceptable loss of recall. In our study, we explored the performance and applicability of the framework for the big data dataset, IJaDataset 2.0, which consists of 356 million lines of code from 25,000 open‐source Java projects. We begin with a computationally inexpensive version of our framework based on pure random shuffling. This version was successful at scaling the tools to IJaDataset but required many subsets to achieve a desirable recall. Using our findings, we incrementally improved the framework to achieve a satisfactory recall using fewer resources. We investigated the use of efficient file tracking and file‐similarity heuristics to bias the shuffling algorithm toward subsets of the dataset that contain undetected clone pairs. These changes were successful in improving the recall performance of the framework. Our study shows that the framework is able to achieve up to 90–95% of a tools native recall using standard hardware. Copyright

international conference on software maintenance | 2016

BigCloneEval: A Clone Detection Tool Evaluation Framework with BigCloneBench

Jeffrey Svajlenko; Chanchal K. Roy

Many clone detection tools have been proposed in the literature. However, our knowledge of their performance in real software systems is limited, particularly their recall. We previously introduced our BigCloneBench, a big clone benchmark of over 8 million clones within a large inter-project Java repository containing 25,000 open-source Java systems. In this paper we present BigCloneEval, a framework for evaluating clone detection tools with BigCloneBench. BigCloneEval makes it very easy for clone detection researchers to evaluate and compare clone detection tools. It automates the execution and evaluation of clone detection tools against the reference clones of BigCloneBench, and summarizes recall performance from a variety of perspectives, including per clone type, and per syntactical similarity regions.

international conference on software engineering | 2017

CloneWorks: a fast and flexible large-scale near-miss clone detection tool

Jeffrey Svajlenko; Chanchal K. Roy

Clone detection within large inter-project source-code repositories has numerous rich applications. CloneWorks is a fast and flexible clone detector for large-scale near-miss clone detection experiments. CloneWorks gives the user full control over the processing of the source code before clone detection, enabling the user to target any clone type or perform custom clone detection experiments. Scalable clone detection is achieved, even on commodity hardware, using our partitioned partial indexes approach. CloneWorks scales to 250MLOC in just four hours on an average workstation with good recall and precision.

international conference on software engineering | 2017

Fast and flexible large-scale clone detection with CloneWorks

Jeffrey Svajlenko; Chanchal K. Roy

Clone detection in very-large inter-project repositories has numerous applications in software research and development. However, existing tools do not provide the flexibility researchers need to explore this emerging domain. We introduce CloneWorks, a fast and flexible clone detector for large-scale clone detection experiments. CloneWorks gives the user full control over the representation of the source code before clone detection, including easy plug-in of custom source transformation, normalization and filtering logic. The user can then perform targeted clone detection for any type or kind of clone of interest. CloneWorks uses our fast and scalable partitioned partial indexes approach, which can handle any input size on an average workstation using input partitioning. CloneWorks can detect Type-3 clones in an input as large as 250 million lines of code in just four hours on an average workstation, with good recall and precision as measured by our BigCloneBench.

International Journal of Software Engineering and Knowledge Engineering | 2016

A Machine Learning Based Approach for Evaluating Clone Detection Tools for a Generalized and Accurate Precision

Jeffrey Svajlenko; Chanchal K. Roy

An important measure of clone detection performance is precision. However, there has been a marked lack of research into methods for efficiently and accurately measuring the precision of a clone de...

Explore More