Is this you? Create Your Porfile

Chengnian Sun

University of California, Davis

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Chengnian Sun is active.

Explore More

Publication

Featured researches published by Chengnian Sun.

automated software engineering | 2011

Towards more accurate retrieval of duplicate bug reports

Chengnian Sun; David Lo; Siau-Cheng Khoo; Jing Jiang

In a bug tracking system, different testers or users may submit multiple reports on the same bugs, referred to as duplicates, which may cost extra maintenance efforts in triaging and fixing bugs. In order to identify such duplicates accurately, in this paper we propose a retrieval function (REP) to measure the similarity between two bug reports. It fully utilizes the information available in a bug report including not only the similarity of textual content in summary and description fields, but also similarity of non-textual fields such as product, component, version, etc. For more accurate measurement of textual similarity, we extend BM25F - an effective similarity formula in information retrieval community, specially for duplicate report retrieval. Lastly we use a two-round stochastic gradient descent to automatically optimize REP for specific bug repositories in a supervised learning manner. We have validated our technique on three large software bug repositories from Mozilla, Eclipse and OpenOffice. The experiments show 10–27% relative improvement in recall rate@k and 17–23% relative improvement in mean average precision over our previous model. We also applied our technique to a very large dataset consisting of 209,058 reports from Eclipse, resulting in a recall rate@k of 37–71% and mean average precision of 47%.

knowledge discovery and data mining | 2009

Classification of software behaviors for failure detection: a discriminative pattern mining approach

David Lo; Hong Cheng; Jiawei Han; Siau-Cheng Khoo; Chengnian Sun

Software is a ubiquitous component of our daily life. We often depend on the correct working of software systems. Due to the difficulty and complexity of software systems, bugs and anomalies are prevalent. Bugs have caused billions of dollars loss, in addition to privacy and security threats. In this work, we address software reliability issues by proposing a novel method to classify software behaviors based on past history or runs. With the technique, it is possible to generalize past known errors and mistakes to capture failures and anomalies. Our technique first mines a set of discriminative features capturing repetitive series of events from program execution traces. It then performs feature selection to select the best features for classification. These features are then used to train a classifier to detect failures. Experiments and case studies on traces of several benchmark software systems and a real-life concurrency bug from MySQL server show the utility of the technique in capturing failures and anomalies. On average, our pattern-based classification technique outperforms the baseline approach by 24.68% in accuracy.

automated software engineering | 2012

Duplicate bug report detection with a combination of information retrieval and topic modeling

Anh Tuan Nguyen; Tung Thanh Nguyen; Tien N. Nguyen; David Lo; Chengnian Sun

Detecting duplicate bug reports helps reduce triaging efforts and save time for developers in fixing the same issues. Among several automated detection approaches, text-based information retrieval (IR) approaches have been shown to outperform others in term of both accuracy and time efficiency. However, those IR-based approaches do not detect well the duplicate reports on the same technical issues written in different descriptive terms. This paper introduces DBTM, a duplicate bug report detection approach that takes advantage of both IR-based features and topic-based features. DBTM models a bug report as a textual document describing certain technical issue(s), and models duplicate bug reports as the ones about the same technical issue(s). Trained with historical data including identified duplicate reports, it is able to learn the sets of different terms describing the same technical issues and to detect other not-yet-identified duplicate ones. Our empirical evaluation on real-world systems shows that DBTM improves the state-of-the-art approaches by up to 20% in accuracy.

working conference on reverse engineering | 2012

Information Retrieval Based Nearest Neighbor Classification for Fine-Grained Bug Severity Prediction

Yuan Tian; David Lo; Chengnian Sun

Bugs are prevalent in software systems. Some bugs are critical and need to be fixed right away, whereas others are minor and their fixes could be postponed until resources are available. In this work, we propose a new approach leveraging information retrieval, in particular BM25-based document similarity function, to automatically predict the severity of bug reports. Our approach automatically analyzes bug reports reported in the past along with their assigned severity labels, and recommends severity labels to newly reported bug reports. Duplicate bug reports are utilized to determine what bug report features, be it textual, ordinal, or categorical, are important. We focus on predicting fine-grained severity labels, namely the different severity labels of Bugzilla including: blocker, critical, major, minor, and trivial. Compared to the existing state-of-the-art study on fine-grained severity prediction, namely the work by Menzies and Marcus, our approach brings significant improvement.

conference on software maintenance and reengineering | 2012

Improved Duplicate Bug Report Identification

Yuan Tian; Chengnian Sun; David Lo

Bugs are prevalent in software systems. To improve the reliability of software systems, developers often allow end users to provide feedback on bugs that they encounter. Users could perform this by sending a bug report in a bug report management system like Bugzilla. This process however is uncoordinated and distributed, which means that many users could submit bug reports reporting the same problem. These are referred to as duplicate bug reports. The existence of many duplicate bug reports may cause much unnecessary manual efforts as often a triager would need to manually tag bug reports as being duplicates. Recently, there have been a number of studies that investigate duplicate bug report problem which in effect answer the following question: given a new bug report, retrieve k other similar bug reports. This, however, still requires substantive manual effort which could be reduced further. Jalbert and Weimer are the first to introduce the direct detection of duplicate bug reports, it answers the question: given a new bug report, classify if it as a duplicate bug report or not. In this paper, we extend Jalbert and Weimers work by improving the accuracy of automated duplicate bug report identification. We experiments with bug reports from Mozilla bug tracking system which were reported between February 2005 to October 2005, and find that we could improve the accuracy of the previous approach by about 160%.

international conference on software maintenance | 2013

DRONE: Predicting Priority of Reported Bugs by Multi-factor Analysis

Yuan Tian; David Lo; Chengnian Sun

Bugs are prevalent. To improve software quality, developers often allow users to report bugs that they found using a bug tracking system such as Bugzilla. Users would specify among other things, a description of the bug, the component that is affected by the bug, and the severity of the bug. Based on this information, bug triagers would then assign a priority level to the reported bug. As resources are limited, bug reports would be investigated based on their priority levels. This priority assignment process however is a manual one. Could we do better? In this paper, we propose an automated approach based on machine learning that would recommend a priority level based on information available in bug reports. Our approach considers multiple factors, temporal, textual, author, related-report, severity, and product, that potentially affect the priority level of a bug report. These factors are extracted as features which are then used to train a discriminative model via a new classification algorithm that handles ordinal class labels and imbalanced data. Experiments on more than a hundred thousands bug reports from Eclipse show that we can outperform baseline approaches in terms of average F-measure by a relative improvement of 58.61%.

conference on object oriented programming systems languages and applications | 2015

Finding deep compiler bugs via guided stochastic program mutation

Vu Le; Chengnian Sun; Zhendong Su

Compiler testing is important and challenging. Equivalence Modulo Inputs (EMI) is a recent promising approach for compiler validation. It is based on mutating the unexecuted statements of an existing program under some inputs to produce new equivalent test programs w.r.t. these inputs. Orion is a simple realization of EMI by only randomly deleting unexecuted statements. Despite its success in finding many bugs in production compilers, Orion’s effectiveness is still limited by its simple, blind mutation strategy. To more effectively realize EMI, this paper introduces a guided, advanced mutation strategy based on Bayesian optimization. Our goal is to generate diverse programs to more thoroughly exercise compilers. We achieve this with two techniques: (1) the support of both code deletions and insertions in the unexecuted regions, leading to a much larger test program space; and (2) the use of an objective function that promotes control-flow-diverse programs for guiding Markov Chain Monte Carlo (MCMC) optimization to explore the search space. Our technique helps discover deep bugs that require elaborate mutations. Our realization, Athena, targets C compilers. In 19 months, Athena has found 72 new bugs — many of which are deep and important bugs — in GCC and LLVM. Developers have confirmed all 72 bugs and fixed 68 of them.

Empirical Software Engineering | 2015

Automated prediction of bug report priority using multi-factor analysis

Yuan Tian; David Lo; Xin Xia; Chengnian Sun

automated software engineering | 2013

TzuYu: learning stateful typestates

Hao Xiao; Jun Sun; Yang Liu; Shang-Wei Lin; Chengnian Sun

Behavioral models are useful for various software engineering tasks. They are, however, often missing in practice. Thus, specification mining was proposed to tackle this problem. Existing work either focuses on learning simple behavioral models such as finite-state automata, or relies on techniques (e.g., symbolic execution) to infer finite-state machines equipped with data states, referred to as stateful typestates. The former is often inadequate as finite-state automata lack expressiveness in capturing behaviors of data-rich programs, whereas the latter is often not scalable. In this work, we propose a fully automated approach to learn stateful typestates by extending the classic active learning process to generate transition guards (i.e., propositions on data states). The proposed approach has been implemented in a tool called TzuYu and evaluated against a number of Java classes. The evaluation results show that TzuYu is capable of learning correct stateful typestates more efficiently.

international symposium on software testing and analysis | 2015

Randomized stress-testing of link-time optimizers

Vu Le; Chengnian Sun; Zhendong Su

Link-time optimization (LTO) is an increasingly important and adopted modern optimization technology. It is currently supported by many production compilers, including GCC, LLVM, and Microsoft Visual C/C++. Despite its complexity, but because it is more recent, LTO is relatively less tested compared to the more mature, traditional optimizations. To evaluate and help improve the quality of LTO, we present the first extensive effort to stress-test the LTO components of GCC and LLVM, the two most widely-used production C compilers. In 11 months, we have discovered and reported 37 bugs (12 in GCC; 25 in LLVM). Developers have confirmed 21 of our bugs, and fixed 11 of them. Our core technique is differential testing and realized in the tool Proteus. We leverage existing compiler testing tools (Csmith and Orion) to generate single-file test programs and address two important challenges specific for LTO testing. First, to thoroughly exercise LTO, Proteus automatically transforms a single-file program into multiple compilation units and stochastically assigns each an optimization level. Second, for effective bug reporting, we develop a practical mechanism to reduce LTO bugs involving multiple files. Our results clearly demonstrate Proteus’s utility; we plan to make ours a continuous effort in validating link-time optimizers.

Explore More