Chaoqiang Zhang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Chaoqiang Zhang is active.

Explore More

Publication

Featured researches published by Chaoqiang Zhang.

international symposium on software testing and analysis | 2013

Comparing non-adequate test suites using coverage criteria

Milos Gligoric; Alex Groce; Chaoqiang Zhang; Rohan Sharma; Mohammad Amin Alipour; Darko Marinov

A fundamental question in software testing research is how to compare test suites, often as a means for comparing test-generation techniques. Researchers frequently compare test suites by measuring their coverage. A coverage criterion C provides a set of test requirements and measures how many requirements a given suite satisfies. A suite that satisfies 100% of the (feasible) requirements is C-adequate. Previous rigorous evaluations of coverage criteria mostly focused on such adequate test suites: given criteria C and C′, are C-adequate suites (on average) more effective than C′-adequate suites? However, in many realistic cases producing adequate suites is impractical or even impossible. We present the first extensive study that evaluates coverage criteria for the common case of non-adequate test suites: given criteria C and C′, which one is better to use to compare test suites? Namely, if suites T1, T2 . . . Tn have coverage values c1, c2 . . . cn for C and c′1, c′2 . . . c′n for C′, is it better to compare suites based on c1, c2 . . . cn or based on c′1, c′ 2 . . . c′n? We evaluate a large set of plausible criteria, including statement and branch coverage, as well as stronger criteria used in recent studies. Two criteria perform best: branch coverage and an intra-procedural acyclic path coverage.

international symposium on software testing and analysis | 2012

Swarm testing

Alex Groce; Chaoqiang Zhang; Eric Eide; Yang Chen; John Regehr

Swarm testing is a novel and inexpensive way to improve the diversity of test cases generated during random testing. Increased diversity leads to improved coverage and fault detection. In swarm testing, the usual practice of potentially including all features in every test case is abandoned. Rather, a large “swarm” of randomly generated configurations, each of which omits some features, is used, with configurations receiving equal resources. We have identified two mechanisms by which feature omission leads to better exploration of a system’s state space. First, some features actively prevent the system from executing interesting behaviors; e.g., “pop” calls may prevent a stack data structure from executing a bug in its overflow detection logic. Second, even when there is no active suppression of behaviors, test features compete for space in each test, limiting the depth to which logic driven by features can be explored. Experimental results show that swarm testing increases coverage and can improve fault detection dramatically; for example, in a week of testing it found 42% more distinct ways to crash a collection of C compilers than did the heavily hand-tuned default configuration of a random tester.

Software Testing, Verification & Reliability | 2016

Cause reduction: delta debugging, even without bugs

Alex Groce; Mohammad Amin Alipour; Chaoqiang Zhang; Yang Chen; John Regehr

What is a test case for? Sometimes, to expose a fault. Tests can also exercise code, use memory or time, or produce desired output. Given a desired effect, a test case can be seen as a cause, and its components divided into essential (required for effect) and accidental. Delta debugging is used for removing accidents from failing test cases, producing smaller test cases that are easier to understand. This paper extends delta debugging by simplifying test cases with respect to arbitrary effects, a generalization called cause reduction. Suites produced by cause reduction provide effective quick tests for real‐world programs. For Mozillas JavaScript engine, the reduced suite is possibly more effective for finding faults. The effectiveness of reduction‐based suites persists through changes to the software, improving coverage by over 500 branches for versions up to 4 months later. Cause reduction has other applications, including improving seeded symbolic execution, where using reduced tests can often double the number of additional branches explored. Copyright

IEEE Transactions on Software Engineering | 2014

You Are the Only Possible Oracle: Effective Test Selection for End Users of Interactive Machine Learning Systems

Alex Groce; Todd Kulesza; Chaoqiang Zhang; Shalini Shamasunder; Margaret M. Burnett; Weng-Keen Wong; Simone Stumpf; Shubhomoy Das; Amber Shinsel; Forrest Bice; Kevin McIntosh

How do you test a program when only a single user, with no expertise in software testing, is able to determine if the program is performing correctly? Such programs are common today in the form of machine-learned classifiers. We consider the problem of testing this common kind of machine-generated program when the only oracle is an end user: e.g., only you can determine if your email is properly filed. We present test selection methods that provide very good failure rates even for small test suites, and show that these methods work in both large-scale random experiments using a “gold standard” and in studies with real users. Our methods are inexpensive and largely algorithm-independent. Key to our methods is an exploitation of properties of classifiers that is not possible in traditional software testing. Our results suggest that it is plausible for time-pressured end users to interactively detect failures-even very hard-to-find failures-without wading through a large number of successful (and thus less useful) tests. We additionally show that some methods are able to find the arguably most difficult-to-detect faults of classifiers: cases where machine learning algorithms have high confidence in an incorrect result.

ACM Transactions on Software Engineering and Methodology | 2015

Guidelines for Coverage-Based Comparisons of Non-Adequate Test Suites

Milos Gligoric; Alex Groce; Chaoqiang Zhang; Rohan Sharma; Mohammad Amin Alipour; Darko Marinov

A fundamental question in software testing research is how to compare test suites, often as a means for comparing test-generation techniques that produce those test suites. Researchers frequently compare test suites by measuring their coverage. A coverage criterion C provides a set of test requirements and measures how many requirements a given suite satisfies. A suite that satisfies 100% of the feasible requirements is called C-adequate. Previous rigorous evaluations of coverage criteria mostly focused on such adequate test suites: given two criteria C and C′, are C-adequate suites on average more effective than C′-adequate suites? However, in many realistic cases, producing adequate suites is impractical or even impossible. This article presents the first extensive study that evaluates coverage criteria for the common case of non-adequate test suites: given two criteria C and C′, which one is better to use to compare test suites? Namely, if suites T1, T2,…,Tn have coverage values c1, c2,…,cn for C and c1′, c2′,…,cn′ for C′, is it better to compare suites based on c1, c2,…,cn or based on c1′, c2′,…,cn′? We evaluate a large set of plausible criteria, including basic criteria such as statement and branch coverage, as well as stronger criteria used in recent studies, including criteria based on program paths, equivalence classes of covered statements, and predicate states. The criteria are evaluated on a set of Java and C programs with both manually written and automatically generated test suites. The evaluation uses three correlation measures. Based on these experiments, two criteria perform best: branch coverage and an intraprocedural acyclic path coverage. We provide guidelines for testing researchers aiming to evaluate test suites using coverage criteria as well as for other researchers evaluating coverage criteria for research use.

international symposium on software testing and analysis | 2014

Using test case reduction and prioritization to improve symbolic execution

Chaoqiang Zhang; Alex Groce; Mohammad Amin Alipour

Scaling symbolic execution to large programs or programs with complex inputs remains difficult due to path explosion and complex constraints, as well as external method calls. Additionally, creating an effective test structure with symbolic inputs can be difficult. A popular symbolic execution strategy in practice is to perform symbolic execution not “from scratch” but based on existing test cases. This paper proposes that the effectiveness of this approach to symbolic execution can be enhanced by (1) reducing the size of seed test cases and (2) prioritizing seed test cases to maximize exploration efficiency. The proposed test case reduction strategy is based on a recently introduced generalization of delta debugging, and our prioritization techniques include novel methods that, for this purpose, can outperform some traditional regression testing algorithms. We show that applying these methods can significantly improve the effectiveness of symbolic execution based on existing test cases.

international symposium on software reliability engineering | 2013

Help, help, i'm being suppressed! The significance of suppressors in software testing

Alex Groce; Chaoqiang Zhang; Mohammad Amin Alipour; Eric Eide; Yang Chen; John Regehr

Test features are basic compositional units used to describe what a test does (and does not) involve. For example, in API-based testing, the most obvious features are function calls; in grammar-based testing, the obvious features are the elements of the grammar. The relationship between features as abstractions of tests and produced behaviors of the tested program is surprisingly poorly understood. This paper shows how large-scale random testing modified to use diverse feature sets can uncover causal relationships between what a test contains and what the program being tested does. We introduce a general notion of observable behaviors as targets, where a target can be a detected fault, an executed branch or statement, or a complex coverage entity such as a state, predicate-valuation, or program path. While it is obvious that targets have triggers - features without which they cannot be hit by a test - the notion of suppressors - features which make a test less likely to hit a target - has received little attention despite having important implications for automated test generation and program understanding. For a set of subjects including C compilers, a flash file system, and JavaScript engines, we show that suppression is both common and important.

international symposium on quality electronic design | 2015

Exploiting abstraction, learning from random simulation, and SVM classification for efficient dynamic prediction of software health problems

Miroslav N. Velev; Chaoqiang Zhang; Ping Gao; Alex Groce

We present industrial experience on software health monitoring. Our goal was to determine whether we can predict abnormal behavior, based on data captured from software system interfaces. To analyze the system state and predict software health problems, we used Support Vector Machine (SVM) based analysis. To train the SVM, we exploited random testing with feedback and swarm testing with feedback to generate traces that exercise diverse scenarios, including both normal and abnormal behaviors that can be classified based on the system state after completing an API call. We then used the resulting classifier produced by the SVM-based analysis to predict whether an API call will result in abnormal behavior, given the input values to the API, and other system information. We applied this procedure to a subset of the API functions in the YAFFS2 flash file system, with the objective of predicting whether the health parameter of available free space will go below a threshold, relative to the total space in the flash file system. For several API functions, we achieved prediction accuracy of over 96%. We attribute the high prediction accuracy to using random testing with feedback that is optimized to produce execution traces with highly diverse behavior, which combined with the chosen representation of the system state and length of the traces resulted in a sufficient number of training vectors with diverse numeric values for the API functions of interest.

programming language design and implementation | 2013