Is this you? Create Your Porfile

Hitesh Sajnani

University of California, Irvine

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hitesh Sajnani is active.

Explore More

Publication

Featured researches published by Hitesh Sajnani.

international conference on software engineering | 2016

SourcererCC: scaling code clone detection to big-code

Hitesh Sajnani; Vaibhav Saini; Jeffrey Svajlenko; Chanchal K. Roy; Cristina Videira Lopes

Despite a decade of active research, there has been a marked lack in clone detection techniques that scale to large repositories for detecting near-miss clones. In this paper, we present a token-based clone detector, SourcererCC, that can detect both exact and near-miss clones from large inter-project repositories using a standard workstation. It exploits an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone. We evaluate the scalability, execution time, recall and precision of SourcererCC, and compare it to four publicly available and state-of-the-art tools. To measure recall, we use two recent benchmarks: (1) a big benchmark of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. We find SourcererCC has both high recall and precision, and is able to scale to a large inter-project repository (25K projects, 250MLOC) using a standard workstation.

international conference on software maintenance | 2011

File cloning in open source Java projects: The good, the bad, and the ugly

Joel Ossher; Hitesh Sajnani; Cristina Videira Lopes

We present a study of the extent to which developers copy entire files or sets of files into their applications with little or no modification. Our aim is to determine the prevalence of such activity within open source Java development, and to identify the circumstances under which files are reused in this manner. To accomplish this aim, we developed a novel method of file-level code clone detection that is scalable to millions of files. We applied our method to the Sourcerer Repository, which contains over 13,000 Java projects aggregated from multiple open source repositories. Our method detected that in excess of 10% of files are clones, and that over 15% of all projects contain at least one cloned file. In addition to computing these raw numbers, we manually examined a large number of the reported clones. We found the most commonly cloned files to be Java extension classes and popular third-party libraries, both large and small. We also discovered a number of projects that occur in multiple online repositories, have been forked, or were divided into multiple subprojects.

mining software repositories | 2012

Trendy bugs: topic trends in the Android bug reports

Lee Martie; Vijay Krishna Palepu; Hitesh Sajnani; Cristina Videira Lopes

Studying vast volumes of bug and issue discussions can give an understanding of what the community has been most concerned about, however the magnitude of documents can overload the analyst. We present an approach to analyze the development of the Android open source project by observing trends in the bug discussions in the Android open source project public issue tracker. This informs us of the features or parts of the project that are more problematic at any given point of time. In turn, this can be used to aid resource allocation (such as time and man power) to parts or features. We support these ideas by presenting the results of issue topic distributions over time using statistical analysis of the bug descriptions and comments for the Android open source project. Furthermore, we show relationships between those time distributions and major development releases of the Android OS.

source code analysis and manipulation | 2015

Can the use of types and query expansion help improve large-scale code search?

Otávio Augusto Lazzarini Lemos; Adriano Carvalho de Paula; Hitesh Sajnani; Cristina Videira Lopes

With the open source code movement, code search with the intent of reuse has become increasingly popular. So much so that researchers have been calling it the new facet of software reuse. Although code search differs from general-purpose document search in essential ways, most tools still rely mainly on keywords matched against source code text. Recently, researchers have proposed more sophisticated ways to perform code search, such as including interface definitions in the queries (e.g., return and parameter types of the desired function, along with keywords; called here Interface-Driven Code Search - IDCS). However, to the best of our knowledge, there are few empirical studies that compare traditional keyword-based code search (KBCS) with more advanced approaches such as IDCS. In this paper we describe an experiment that compares the effectiveness of KBCS with IDCS in the task of large-scale code search of auxiliary functions implemented in Java. We also measure the impact of query expansion based on types and WordNet on both approaches. Our experiment involved 36 subjects that produced real-world queries for 16 different auxiliary functions and a repository with more than 2,000,000 Java methods. Results show that the use of types can improve recall and the number of relevant functions returned (#RFR) when combined with query expansion (~30% improvement in recall, and ~43% improvement in #RFR). However, a more detailed analysis suggests that in some situations it is best to use keywords only, in particular when these are sufficient to semantically define the desired function.

mining software repositories | 2014

A dataset for maven artifacts and bug patterns found in them

Vaibhav Saini; Hitesh Sajnani; Joel Ossher; Cristina Videira Lopes

In this paper, we present data downloaded from Maven, one of the most popular component repositories. The data includes the binaries of 186,392 components, along with source code for 161,025. We identify and organize these components into groups where each group contains all the versions of a library. In order to asses the quality of these components, we make available report generated by the FindBugs tool on 64,574 components. The information is also made available in the form of a database which stores total number, type, and priority of bug patterns found in each component, along with its defect density. We also describe how this dataset can be useful in software engineering research.

international workshop on software clones | 2013

A parallel and efficient approach to large scale clone detection

Hitesh Sajnani; Cristina Videira Lopes

Over the past few years, researchers have implemented various algorithms to improve the scalability of clone detection. Most of these algorithms focus on scaling vertically on a single machine, and require complex intermediate data structures (e.g., suffix tree, etc.). However, several new use-cases of clone detection have emerged, which are beyond the computational capacity of a single machine. Moreover, for some of these use-cases it may be expensive to invest upfront in the cost of building these data structures. In this paper, we propose a technique to horizontally scale clone detection across multiple machines using the popular MapReduce framework. The technique does not require building any complex intermediate data structures. Moreover, in order to increase the efficiency, the technique uses a filtering heuristic to prune the number of code block comparisons. The filtering heuristic is independent of our approach and it can be used in conjunction with other approaches to increase their efficiency. In our experiments, we found that: (i) the computation time to detect clones decreases by almost half every time we double the number of nodes; and (ii) the scaleup is linear, with a decline of not more than 70% compared to the ideal case, on a cluster of 2-32 nodes for 150-2800 projects.

international conference on program comprehension | 2012

Parallel code clone detection using MapReduce

Hitesh Sajnani; Joel Ossher; Cristina Videira Lopes

Code clone detection is an established topic in software engineering research. Many detection algorithms have been proposed and refined but very few exploit the inherent parallelism present in the problem, making large scale code clone detection difficult. To alleviate this shortcoming, we present a new technique to efficiently perform clone detection using the popular MapReduce paradigm. Preliminary experimental results demonstrates speed-up and scale-up of the proposed approach.

working conference on reverse engineering | 2011

Application Architecture Discovery - Towards Domain-driven, Easily-Extensible Code Structure

Hitesh Sajnani; Ravindra Naik; Cristina Videira Lopes

The architecture of a software system and its code structure have a strong impact on its maintainability -- the ability to fix problems, and make changes to the system efficiently. To ensure maintainability, software systems are usually organized as subsystems or modules, each with atomically defined responsibilities. However, as the system evolves, the structure of the system undergoes continuous modifications, drifting away from its original design, leading to functionally non-atomic modules and intertwined dependencies between the modules. In this paper, we propose an approach to improve the code structure and architecture by leveraging the domain knowledge of the system. Our approach exploits the knowledge about the functional architecture of the system to restructure the source code and align physically with the functional elements and the re-usable library layers. The approach is validated by applying to a case study which is an existing financial system. The preliminary analysis for the case-study reveals that the approach creates meaningful structure from the legacy code, which enables the developers to quickly identify the code that implements a given functionality.

international conference on software engineering | 2016

SourcererCC and SourcererCC-I: tools to detect clones in batch mode and during software development

Vaibhav Saini; Hitesh Sajnani; Jaewoo Kim; Cristina Videira Lopes

Given the availability of large source-code repositories, there has been a large number of applications for large-scale clone detection. Unfortunately, despite a decade of active research, there is a marked lack in clone detectors that scale to big software systems or large repositories, specifically for detecting near-miss (Type 3) clones where significant editing activities may take place in the cloned code. This paper demonstrates: (i) SourcererCC, a token-based clone detector that targets the first three clone types, and exploits an index to achieve scalability to large inter-project repositories using a standard workstation. It uses an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect clones, as well as the number of required token-comparisons needed to judge a potential clone; and (ii) SourcererCC-I, an Eclipse plug-in, that uses SourcererCCs core engine to identify and navigate clones (both inter and intra project) in real-time during software development. In our experiments, comparing SourcererCC with the state-of-the-art tools we found that it is the only clone detection tool to successfully scale to 250 MLOC on a standard workstation with 12 GB RAM and efficiently detect the first three types of clones (precision 86% and recall 86-100%). Link to the demo: \url{https://youtu.be/l7F_9Qp-ks4}

international conference on software engineering | 2014

Active files as a measure of software maintainability

Lukas Schulte; Hitesh Sajnani; Jacek Czerwonka

In this paper, we explore the set of source files which are changed unusually often. We define these files as active files. Although discovery of active files relies only on version history and defect classification, the simple concept of active files can deliver key insights into software development activities. Active files can help focus code reviews, implement targeted testing, show areas for potential merge conflicts and identify areas that are central for program comprehension. In an empirical study of six large software systems within Microsoft ranging from products to services, we found that active files constitute only between 2-8% of the total system size, contribute 20-40% of system file changes, and are responsible for 60-90% of all defects. Not only this, but we establish that the majority, 65-95%, of the active files are architectural hub files which change due to feature addition as opposed to fixing defects.

Explore More