Marcel Böhme
National University of Singapore
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Marcel Böhme.
international conference on software engineering | 2013
Marcel Böhme; Bruno C. d. S. Oliveira; Abhik Roychoudhury
Regression verification (RV) seeks to guarantee the absence of regression errors in a changed program version. This paper presents Partition-based Regression Verification (PRV): an approach to RV based on the gradual exploration of differential input partitions. A differential input partition is a subset of the common input space of two program versions that serves as a unit of verification. Instead of proving the absence of regression for the complete input space at once, PRV verifies differential partitions in a gradual manner. If the exploration is interrupted, PRV retains partial verification guarantees at least for the explored differential partitions. This is crucial in practice as verifying the complete input space can be prohibitively expensive. Experiments show that PRV provides a useful alternative to state-of-the-art regression test generation techniques. During the exploration, PRV generates test cases which can expose different behaviour across two program versions. However, while test cases are generally single points in the common input space, PRV can verify entire partitions and moreover give feedback that allows programmers to relate a behavioral difference to those syntactic changes that contribute to this difference.
international symposium on software testing and analysis | 2014
Marcel Böhme; Abhik Roychoudhury
Intuitively we know, some software errors are more complex than others. If the error can be fixed by changing one faulty statement, it is a simple error. The more substantial the fix must be, the more complex we consider the error. In this work, we formally define and quantify the complexity of an error w.r.t. the complexity of the errors least complex, correct fix. As a concrete measure of complexity for such fixes, we introduce Cyclomatic Change Complexity which is inspired by existing program complexity metrics. Moreover, we introduce CoREBench, a collection of 70 regression errors systematically extracted from several open-source C-projects and compare their complexity with that of the seeded errors in the two most popular error benchmarks, SIR and the Siemens Suite. We find that seeded errors are significantly less complex, i.e., require significantly less substantial fixes, compared to actual regression errors. For example, among the seeded errors more than 42% are simple compared to 8% among the actual ones. This is a concern for the external validity of studies based on seeded errors and we propose CoREBench for the controlled study of regression testing, debugging, and repair techniques.
foundations of software engineering | 2013
Marcel Böhme; Bruno C. d. S. Oliveira; Abhik Roychoudhury
Changes often introduce program errors, and hence recent software testing literature has focused on generating tests which stress changes. In this paper, we argue that changes cannot be treated as isolated program artifacts which are stressed via testing. Instead, it is the complex dependency across multiple changes which introduce subtle errors. Furthermore, the complex dependence structures, that need to be exercised to expose such errors, ensure that they remain undiscovered even in well tested and deployed software. We motivate our work based on empirical evidence from a well tested and stable project - Linux GNU Coreutils - where we found that one third of the regressions take more than two (2) years to be fixed, and that two thirds of such long-standing regressions are introduced due to change interactions for the utilities we investigated. To combat change interaction errors, we first define a notion of change interaction where several program changes are found to affect the result of a program statement via program dependencies. Based on this notion, we propose a change sequence graph (CSG) to summarize the control-flow and dependencies across changes. The CSG is then used as a guide during program path exploration via symbolic execution - thereby efficiently producing test cases which witness change interaction errors. Our experimental infra-structure was deployed on various utilities of GNU Coreutils, which have been distributed with Linux for almost twenty years. Apart from finding five (5) previously unknown errors in the utilities, we found that only one in five generated test cases exercises a sequence that is critical to exposing a change-interaction error, while being an order of magnitude more likely to expose an error. On the other hand, stressing changes in isolation only exposed half of the change interaction errors. These results demonstrate the importance and difficulty of change dependence-aware regression testing.
automated software engineering | 2016
Van-Thuan Pham; Marcel Böhme; Abhik Roychoudhury
Many real-world programs take highly structured and complex files as inputs. The automated testing of such programs is non-trivial. If the test does not adhere to a specific file format, the program returns a parser error. For symbolic execution-based whitebox fuzzing the corresponding error handling code becomes a significant time sink. Too much time is spent in the parser exploring too many paths leading to trivial parser errors. Naturally, the time is better spent exploring the functional part of the program where failure with valid input exposes deep and real bugs in the program. In this paper, we suggest to leverage information about the file format and data chunks of existing, valid files to swiftly carry the exploration beyond the parser code. We call our approach Modelbased Whitebox Fuzzing (MoWF) because the file format input model of blackbox fuzzers can be exploited as a constraint on the vast input space to rule out most invalid inputs during path exploration in symbolic execution. We evaluate on 13 vulnerabilities in 8 large program binaries with 6 separate file formats and found that MoWF exposes all vulnerabilities while both, traditional whitebox fuzzing and model-based blackbox fuzzing, expose only less than half, respectively. Our experiments also demonstrate that MoWF exposes 70% vulnerabilities without any seed inputs.
generative programming and component engineering | 2009
Florian Heidenreich; Jendrik Johannes; Mirko Seifert; Christian Wende; Marcel Böhme
Template languages are widely used within generative programming, because they provide intuitive means to generate software artefacts expressed in a specific object language. However, most template languages perform template instantiation on the level of string literals, which allows neither syntax checks nor semantics analysis. To make sure that generated artefacts always conform to the object language, we propose to perform static analysis at template design time. In addition, the increasing popularity of domainspecific languages (DSLs) demands an approach that allows to reuse both the concepts of template languages and the corresponding tools. In this paper we address the issues mentioned above by presenting how existing languages can be automatically extended with generic template concepts (e.g., placeholders, loops, conditions) to obtain safe template languages. These languages provide means for syntax checking and static semantic analysis w.r.t. the object language at template design time. We discuss the prerequisites for this extension, analyse the types of correctness properties that can be assured at template design time, and exemplify the key benefits of this approach on a textual DSL and Java.
IEEE Transactions on Software Engineering | 2016
Marcel Böhme; Soumya Paul
We study the relative efficiencies of the random and systematic approaches to automated software testing. Using a simple but realistic set of assumptions, we propose a general model for software testing and define sampling strategies for random (R) and systematic (S0) testing, where each sampling is associated with a sampling cost: 1 and c units of time, respectively. The two most important goals of software testing are: (i) achieving in minimal time a given degree of confidence x in a programs correctness and (ii) discovering a maximal number of errors within a given time bound n̂. For both (i) and (ii), we show that there exists a bound on c beyond which R performs better than S0 on the average. Moreover for (i), this bound depends asymptotically only on x. We also show that the efficiency of R can be fitted to the exponential curve. Using these results we design a hybrid strategy H that starts with R and switches to S0 when S0 is expected to discover more errors per unit time. In our experiments we find that H performs similarly or better than the most efficient of both and that S0 may need to be significantly faster than our bounds suggest to retain efficiency over R.
international conference on software engineering | 2012
Marcel Böhme
It has been known for more than 20 years. If the subdomains are not homogeneous, partition testing strategies, such as branch or statement testing, do neither perform significantly better than random input generation nor do they inspire confidence when a test suite succeeds. Yet, measuring the adequacy of test suites in terms of code coverage is still considered a common practice. The main target of our research is to develop strategies for the automatic evolution of a test suite that does inspire confidence. When the program is changed, test cases shall be augmented that witness changed output for the same input (test suite augmentation). If two test cases witness the same partition, one is to be discarded (test suite reduction).
ACM Transactions on Software Engineering and Methodology | 2018
Marcel Böhme
A fundamental challenge of software testing is the statistically well-grounded extrapolation from program behaviors observed during testing. For instance, a security researcher who has run the fuzzer for a week has currently no means (i) to estimate the total number of feasible program branches, given that only a fraction has been covered so far, (ii) to estimate the additional time required to cover 10% more branches, or (iii) to assess the residual risk that a vulnerability exists when no vulnerability has been discovered. Failing to discover a vulnerability, does not mean that none exists---even if the fuzzer was run for a week (or a year). Hence, testing provides no formal correctness guarantees. In this article, I establish an unexpected connection with the otherwise unrelated scientific field of ecology, and introduce a statistical framework that models Software Testing and Analysis as Discovery of Species (STADS). For instance, in order to study the species diversity of arthropods in a tropical rain forest, ecologists would first sample a large number of individuals from that forest, determine their species, and extrapolate from the properties observed in the sample to properties of the whole forest. The estimation (i) of the total number of species, (ii) of the additional sampling effort required to discover 10% more species, or (iii) of the probability to discover a new species are classical problems in ecology. The STADS framework draws from over three decades of research in ecological biostatistics to address the fundamental extrapolation challenge for automated test generation. Our preliminary empirical study demonstrates a good estimator performance even for a fuzzer with adaptive sampling bias---AFL, a state-of-the-art vulnerability detection tool. The STADS framework provides statistical correctness guarantees with quantifiable accuracy.
international conference on software engineering | 2017
Marcel Böhme; Ezekiel O. Soremekun; Sudipta Chattopadhyay; Emamurho Ugherughe; Andreas Zeller
How do professional software engineers debug computer programs? In an experiment with 27 real bugs that existed in several widely used programs, we invited 12 professional software engineers, who together spent one month on localizing, explaining, and fixing these bugs. This did not only allow us to study the various tools and strategies used to debug the same set of errors. We could also determine exactly which statements a developer would localize as faults, how a developer would diagnose and explain an error, and how a developer would fix an error – all of which software engineering researchers seek to automate. Until now, it has been difficult to evaluate the effectiveness and utility of automated debugging techniques without a user study. We publish the collected data, called DBGBENCH, to facilitate the effective evaluation of automated fault localization, diagnosis, and repair techniques w.r.t. the judgement of human experts.
Advances in Computers | 2013
Marcel Böhme; Abhik Roychoudhury; Bruno C. d. S. Oliveira
Abstract Software changes, such as bug fixes or feature additions, can introduce software bugs and reduce the code quality. As a result tests which passed earlier may not pass any more—thereby exposing a regression in software behavior. This survey overviews recent advances in determining the impact of the code changes onto the program’s behavior and other syntactic program artifacts. Static program analysis can help determining change impact in an approximate manner while dynamic analysis determines change impact more precisely but requires a regression test suite. Moreover, as the program is changed, the corresponding test suite may, too. Some tests may become obsolete while other tests are to be augmented that stress the changes. This article surveys such test generation techniques to stress and propagate program changes. It concludes that a combination of dependency analysis and lightweight symbolic execution show promise in providing powerful techniques for regression test generation.