Lingxiao Jiang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Lingxiao Jiang is active.

Explore More

Publication

Featured researches published by Lingxiao Jiang.

international conference on software engineering | 2007

DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones

Lingxiao Jiang; Ghassan Misherghi; Zhendong Su; Stéphane Glondu

Detecting code clones has many software engineering applications. Existing approaches either do not scale to large code bases or are not robust against minor code modifications. In this paper, we present an efficient algorithm for identifying similar subtrees and apply it to tree representations of source code. Our algorithm is based on a novel characterization of subtrees with numerical vectors in the Euclidean space Rnmiddot and an efficient algorithm to cluster these vectors w.r.t. the Euclidean distance metric. Subtrees with vectors in one cluster are considered similar. We have implemented our tree similarity algorithm as a clone detection tool called DECKARD and evaluated it on large code bases written in C and Java including the Linux kernel and JDK. Our experiments show that DECKARD is both scalable and accurate. It is also language independent, applicable to any language with a formally specified grammar.

international conference on software engineering | 2008

Scalable detection of semantic clones

Mark Gabel; Lingxiao Jiang; Zhendong Su

Several techniques have been developed for identifying similar code fragments in programs. These similar fragments, referred to as code clones, can be used to identify redundant code, locate bugs, or gain insight into program design. Existing scalable approaches to clone detection are limited to finding program fragments that are similar only in their contiguous syntax. Other, semantics-based approaches are more resilient to differences in syntax, such as reordered statements, related statements interleaved with other unrelated statements, or the use of semantically equivalent control structures. However, none of these techniques have scaled to real world code bases. These approaches capture semantic information from Program Dependence Graphs (PDGs), program representations that encode data and control dependencies between statements and predicates. Our definition of a code clone is also based on this representation: we consider program fragments with isomorphic PDGs to be clones. In this paper, we present the first scalable clone detection algorithm based on this definition of semantic clones. Our insight is the reduction of the difficult graph similarity problem to a simpler tree similarity problem by mapping carefully selected PDG subgraphs to their related structured syntax. We efficiently solve the tree similarity problem to create a scalable analysis. We have implemented this algorithm in a practical tool and performed evaluations on several million-line open source projects, including the Linux kernel. Compared with previous approaches, our tool locates significantly more clones, which are often more semantically interesting than simple copied and pasted code fragments.

international conference on mobile systems, applications, and services | 2011

Real-time trip information service for a large taxi fleet

Rajesh Krishna Balan; Khoa Xuan Nguyen; Lingxiao Jiang

In this paper, we describe the design, analysis, implementation, and operational deployment of a real-time trip information system that provides passengers with the expected fare and trip duration of the taxi ride they are planning to take. This system was built in cooperation with a taxi operator that operates more than 15,000 taxis in Singapore. We first describe the overall system design and then explain the efficient algorithms used to achieve our predictions based on up to 21 months of historical data consisting of approximately 250 million paid taxi trips. We then describe various optimisations (involving region sizes, amount of history, and data mining techniques) and accuracy analysis (involving routes and weather) we performed to increase both the runtime performance and prediction accuracy. Our large scale evaluation demonstrates that our system is (a) accurate --- with the mean fare error under 1 Singapore dollar (~ 0.76 US

automated software engineering | 2007

Context-aware statistical debugging: from bug predictors to faulty control flow paths

Lingxiao Jiang; Zhendong Su

) and the mean duration error under three minutes, and (b) capable of real-time performance, processing thousands to millions of queries per second. Finally, we describe the lessons learned during the process of deploying this system into a production environment.

international symposium on software testing and analysis | 2009

Automatic mining of functionally equivalent code fragments via random testing

Lingxiao Jiang; Zhendong Su

Effective bug localization is important for realizing automated debugging. One attractive approach is to apply statistical techniques on a collection of evaluation profiles of program properties to help localize bugs. Previous research has proposed various specialized techniques to isolate certain program predicates as bug predictors. However, because many bugs may not be directly associated with these predicates, these techniques are often ineffective in localizing bugs. Relevant control flow paths that may contain bug locations are more informative than stand-alone predicates for discovering and understanding bugs. In this paper, we propose an approach to automatically generate such faulty control flow paths that link many bug predictors together for revealing bugs. Our approach combines feature selection (to accurately select failure-related predicates as bug predictors), clustering (to group correlated predicates), and control flow graph traversal in a novel way to help generate the paths. We have evaluated our approach on code including the Siemens test suite and rhythmbox (a large music management application for GNOME). Our experiments show that the faulty control flow paths are accurate, useful for localizing many bugs, and helped to discover previously unknown errors in rhythmbox

conference on software maintenance and reengineering | 2013

Network Structure of Social Coding in GitHub

Ferdian Thung; Tegawendé François D Assise Bissyande; David Lo; Lingxiao Jiang

Similar code may exist in large software projects due to some common software engineering practices, such as copying and pasting code and n-version programming. Although previous work has studied syntactic equivalence and small-scale, coarse-grained program-level and function-level semantic equivalence, it is not known whether significant fine-grained, code-level semantic duplications exist. Detecting such semantic equivalence is also desirable because it can enable many applications such as code understanding, maintenance, and optimization. In this paper, we introduce the first algorithm to automatically mine functionally equivalent code fragments of arbitrary size - down to an executable statement. Our notion of functional equivalence is based on input and output behavior. Inspired by Schwartzs randomized polynomial identity testing, we develop our core algorithm using automated random testing: (1) candidate code fragments are automatically extracted from the input program; and (2) random inputs are generated to partition the code fragments based on their output values on the generated inputs. We implemented the algorithm and conducted a large-scale empirical evaluation of it on the Linux kernel 2.6.24. Our results show that there exist many functionally equivalent code fragments that are syntactically different (i.e., they are unlikely due to copying and pasting code). The algorithm also scales to million-line programs; it was able to analyze the Linux kernel with several days of parallel processing.

working conference on reverse engineering | 2012

Automatic Defect Categorization

Ferdian Thung; David Lo; Lingxiao Jiang

Social coding enables a different experience of software development as the activities and interests of one developer are easily advertised to other developers. Developers can thus track the activities relevant to various projects in one umbrella site. Such a major change in collaborative software development makes an investigation of networkings on social coding sites valuable. Furthermore, project hosting platforms promoting this development paradigm have been thriving, among which GitHub has arguably gained the most momentum. In this paper, we contribute to the body of knowledge on social coding by investigating the network structure of social coding in GitHub. We collect 100,000 projects and 30,000 developers from GitHub, construct developer-developer and project-project relationship graphs, and compute various characteristics of the graphs. We then identify influential developers and projects on this sub network of GitHub by using PageRank. Understanding how developers and projects are actually related to each other on a social coding site is the first step towards building tool supports to aid social programmers in performing their tasks more efficiently.

Journal of Software: Evolution and Process | 2014

Extended comprehensive study of association measures for fault localization

Lucia Lucia; David Lo; Lingxiao Jiang; Ferdian Thung; Aditya Budi

Defects are prevalent in software systems. In order to understand defects better, industry practitioners often categorize bugs into various types. One common kind of categorization is the IBMs Orthogonal Defect Classification (ODC). ODC proposes various orthogonal classification of defects based on much information about the defects, such as the symptoms and semantics of the defects, the root cause analysis of the defects, and many more. With these category labels, developers can better perform post-mortem analysis to find out what the common characteristics of the defects that plague a particular software project are. Albeit the benefits of having these categories, for many software systems, these category labels are often missing. To address this problem, we propose a text mining solution that can categorize defects into various types by analyzing both texts from bug reports and code features from bug fixes. To this end, we have manually analyzed the data about 500 defects from three software systems, and classified them according to ODC. In addition, we propose a classification-based approach that can automatically classify defects into three super-categories that are comprised of ODC categories: control and data flow, structural, and non-functional. Our empirical evaluation shows that the automatic classification approach is able to label defects with an average accuracy of 77.8% by using the SVM multiclass classification algorithm.

international conference on software maintenance | 2012

Inferring semantically related software terms and their taxonomy by leveraging collaborative tagging

Shaowei Wang; David Lo; Lingxiao Jiang

Spectrum‐based fault localization is a promising approach to automatically locate root causes of failures quickly. Two well‐known spectrum‐based fault localization techniques, Tarantula and Ochiai, measure how likely a program element is a root cause of failures based on profiles of correct and failed program executions. These techniques are conceptually similar to association measures that have been proposed in statistics, data mining, and have been utilized to quantify the relationship strength between two variables of interest (e.g., the use of a medicine and the cure rate of a disease). In this paper, we view fault localization as a measurement of the relationship strength between the execution of program elements and program failures. We investigate the effectiveness of 40 association measures from the literature on locating bugs. Our empirical evaluations involve single‐bug and multiple‐bug programs. We find there is no best single measure for all cases. Klosgen and Ochiai outperform other measures for localizing single‐bug programs. Although localizing multiple‐bug programs, Added Value could localize the bugs with on average smallest percentage of inspected code, whereas a number of other measures have similar performance. The accuracies of the measures in localizing multi‐bug programs are lower than single‐bug programs, which provokes future research. Copyright

acm symposium on applied computing | 2013

An empirical study on developer interactions in StackOverflow

Shaowei Wang; David Lo; Lingxiao Jiang

Many software engineering tasks, such as feature location and duplicate bug report detection, leverages similarities among textual corpora. However, due to the different words used by developers to express the same concept, exact matching of words is insufficient. One document can contain a particular word while the other document may contain another word that is semantically related but is not the same. Such word differences may cause inaccuracies in subsequent software engineering tasks. Recently, tagging has impacted the software engineering community. Developers increasingly use tags to describe important features of a software product. Many project hosting sites allow users to tag various projects with their own words. It becomes increasingly important to understand and relate these tags. Based on the tags available from software project hosting websites, we propose a similarity metric to infer semantically related terms, each of which is a tag, and build a taxonomy that could further describe the relationships among these terms. We have built a sample taxonomy from tens of thousands of projects and their tags. Our user studies show that our proposed similarity metric for tags are indeed related to the semantic similarity of the terms, and the resultant semantic taxonomy among terms is reasonably good.

Explore More