Siau-Cheng Khoo | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Siau-Cheng Khoo is active.

Explore More

Publication

Featured researches published by Siau-Cheng Khoo.

automated software engineering | 2011

Towards more accurate retrieval of duplicate bug reports

Chengnian Sun; David Lo; Siau-Cheng Khoo; Jing Jiang

In a bug tracking system, different testers or users may submit multiple reports on the same bugs, referred to as duplicates, which may cost extra maintenance efforts in triaging and fixing bugs. In order to identify such duplicates accurately, in this paper we propose a retrieval function (REP) to measure the similarity between two bug reports. It fully utilizes the information available in a bug report including not only the similarity of textual content in summary and description fields, but also similarity of non-textual fields such as product, component, version, etc. For more accurate measurement of textual similarity, we extend BM25F - an effective similarity formula in information retrieval community, specially for duplicate report retrieval. Lastly we use a two-round stochastic gradient descent to automatically optimize REP for specific bug repositories in a supervised learning manner. We have validated our technique on three large software bug repositories from Mozilla, Eclipse and OpenOffice. The experiments show 10–27% relative improvement in recall rate@k and 17–23% relative improvement in mean average precision over our previous model. We also applied our technique to a very large dataset consisting of 209,058 reports from Eclipse, resulting in a recall rate@k of 37–71% and mean average precision of 47%.

knowledge discovery and data mining | 2009

Classification of software behaviors for failure detection: a discriminative pattern mining approach

David Lo; Hong Cheng; Jiawei Han; Siau-Cheng Khoo; Chengnian Sun

Software is a ubiquitous component of our daily life. We often depend on the correct working of software systems. Due to the difficulty and complexity of software systems, bugs and anomalies are prevalent. Bugs have caused billions of dollars loss, in addition to privacy and security threats. In this work, we address software reliability issues by proposing a novel method to classify software behaviors based on past history or runs. With the technique, it is possible to generalize past known errors and mistakes to capture failures and anomalies. Our technique first mines a set of discriminative features capturing repetitive series of events from program execution traces. It then performs feature selection to select the best features for classification. These features are then used to train a classifier to detect failures. Experiments and case studies on traces of several benchmark software systems and a real-life concurrency bug from MySQL server show the utility of the technique in capturing failures and anomalies. On average, our pattern-based classification technique outperforms the baseline approach by 24.68% in accuracy.

knowledge discovery and data mining | 2007

Efficient mining of iterative patterns for software specification discovery

David Lo; Siau-Cheng Khoo; Chao Liu

Studies have shown that program comprehension takes up to 45% of software development costs. Such high costs are caused by the lack-of documented specification and further aggravated by the phenomenon of software evolution. There is a need for automated tools to extract specifications to aid program comprehension. In this paper, a novel technique to efficiently mine common software temporal patterns from traces is proposed. These patterns shed light on program behaviors, and are termed iterative patterns. They capture unique characteristic of software traces, typically not found in arbitrary sequences. Specifically, due to loops, interesting iterative patterns can occur multiple times within a trace. Furthermore, an occurrence of an iterative pattern in a trace can extend across a sequence of indefinite length. Since a program behavior can be manifested in numerous ways, analyzing a single trace will not be sufficient. Iterative pattern mining extends sequential pattern and episode minings to discover frequent iterative patterns which occur repetitively both within a program trace and across multiple traces. In this paper, we present CLIPER (CLosed Iterative Pattern minER) to efficiently mine a closed set of iterative patterns. A performance study on several simulated and real datasets shows the efficiency of our mining algorithm and effectiveness of our pruning strategy. Our case study on JBoss Application Server confirms the usefulness of mined patterns in discovering interesting software behavioral specification.

international conference on data engineering | 2009

Efficient Mining of Closed Repetitive Gapped Subsequences from a Sequence Database

Bolin Ding; David Lo; Jiawei Han; Siau-Cheng Khoo

There is a huge wealth of sequence data available, for example, customer purchase histories, program execution traces, DNA, and protein sequences. Analyzing this wealth of data to mine important knowledge is certainly a worthwhile goal.In this paper, as a step forward to analyzing patterns in sequences, we introduce the problem of mining closed repetitive gapped subsequences and propose efficient solutions. Given a database of sequences where each sequence is an ordered list of events, the pattern we would like to mine is called repetitive gapped subsequence, which is a subsequence (possibly with gaps between two successive events within it) of some sequences in the database. We introduce the concept of repetitive support to measure how frequently a pattern repeats in the database. Different from the sequential pattern mining problem, repetitive support captures not only repetitions of a pattern in different sequences but also the repetitions within a sequence. Given a userspecified support threshold min_sup, we study finding the set of all patterns with repetitive support no less than min_sup. To obtain a compact yet complete result set and improve the efficiency, we also study finding closed patterns. Efficient mining algorithms to find the complete set of desired patterns are proposed based on the idea of instance growth. Our performance study on various datasets shows the efficiency of our approach. A case study is also performed to show the utility of our approach.

Sigplan Notices | 1999

Calculating sized types

Wei-Ngan Chin; Siau-Cheng Khoo

Many program optimisations and analyses, such as array-bound checking, termination analysis, etc, depend on knowing the size of a functions input and output. However, size information can be difficult to compute. Firstly, accurate size computation requires detecting size relation between different inputs of a function. Secondly, different optimisations and analyses may require slightly different size information, and thus slightly different computation. Literature in size computation has mainly concentrated on size checking, instead of inferencing. In this paper, we provide a generic framework on which different size variants can be expressed and computed. We also describe an effective algorithm for inferring, instead of checking, size information. Size information are expressed in terms of Presburger formulae, and our algorithm utilises the Omega Calculator to compute as exact a size information as possible, within the linear arithmetic capability.

working conference on reverse engineering | 2006

QUARK: Empirical Assessment of Automaton-based Specification Miners

David Lo; Siau-Cheng Khoo

Software is often built without specification. Tools to automatically extract specification from software are needed and many techniques have been proposed. One type of these specifications - temporal API specification - is often specified in the form of automaton. There has been much work on reverse engineering or mining software temporal specification, using dynamic analysis techniques; i.e., analysis of software program traces. Unfortunately, the issues of scalability, robustness and accuracy of these techniques have not been comprehensively addressed. In this paper, we describe QUARK (quality assurance framework) that enables assessments of the performance of a specification miner in generating temporal specification of software through traces recorded from its API interaction. QUARK requires the temporal specification produced by the miner to be expressed as an automaton. It accepts a user-defined simulator automaton and a specification miner. It produces quality assurance measures on the specification generated by the miner. Extensive experiments on 3 specification miners have been performed to demonstrate the usefulness of our proposed framework

Higher-order and Symbolic Computation \/ Lisp and Symbolic Computation | 2001

Calculating Sized Types

Wei-Ngan Chin; Siau-Cheng Khoo

Many program optimizations and analyses, such as array-bounds checking, termination analysis, etc., depend on knowing the size of a functions input and output. However, size information can be difficult to compute. Firstly, accurate size computation requires detecting a size relation between different inputs of a function. Secondly, different optimizations and analyses may require slightly different size information, and thus slightly different computation. Literature in size computation has mainly concentrated on size checking, instead of size inference. In this paper, we provide a generic framework on which different size variants can be expressed and computed. We also describe an effective algorithm for inferring, instead of checking, size information. Size information are expressed in terms of Presburger formulae, and our algorithm utilizes the Omega Calculator to compute as exact a size information as possible, within the linear arithmetic capability.

automated software engineering | 2007

Mining modal scenario-based specifications from execution traces of reactive systems

David Lo; Shahar Maoz; Siau-Cheng Khoo

Specification mining is a dynamic analysis process aimed at automatically inferring suggested specifications of a program from its execution traces. We describe a novel method, framework, and tool, for mining inter-object scenario-based specifications in the form of a UML2-compliant variant of Damm and Harels Live Sequence Charts (LSC). LSC extends the classical partial order semantics of sequence diagrams with temporal liveness and symbolic class level lifelines, in order to generate compact and expressive specifications. The output of our algorithm is a sound and complete set of statistically significant LSCs (i.e., satisfying given thresholds of support and confidence), mined from an input execution trace. We locate statistically significant LSCs by exploring the search space of possible LSCs and checking for their statistical significance. In addition, we use an effective search space pruning strategy, specifically adapted to LSCs, which enables efficient mining of scenarios of arbitrary size. We demonstrate and evaluate the utility of our work in mining informative specifications using a case study on Jeti, a popular, full featured messaging application

ACM Transactions on Programming Languages and Systems | 1993

Parameterized partial evaluation

Charles Consel; Siau-Cheng Khoo

Besides specializing programs with respect to concrete values, it is often necessary to specialize programs with respect to abstract values, i.e., static properties such as signs, ranges, and types. Specializing programs with respect to static properties is a natural extension of partial evaluation and significantly contributes towards adapting partial evaluation to larger varieties of applications. This idea was first investigated by Haraldsson [14] and carried out in practice with a system called Redfun in the late seventies. This system partially evaluates Interlisp programs. It manipulates symbolic values such as data types to describe the possible values of a variable and a processed expression. Although the work on Redfun certainly started in the right direction, it has some limitations: (1) the static properties cannot be defined by the user; they are fixed; (2) the approach is not formally defined: no safety condition for the definition of symbolic values, no finiteness criteria for fixpoint iteration, etc.; and (3) because Redfun is an on-line partial evaluator—the treatment of the

Information Systems | 2009

Non-redundant sequential rules-Theory and algorithm

David Lo; Siau-Cheng Khoo; Limsoon Wong

A sequential rule expresses a relationship between two series of events happening one after another. Sequential rules are potentially useful for analyzing data in sequential format, ranging from purchase histories, network logs and program execution traces. In this work, we investigate and propose a syntactic characterization of a non-redundant set of sequential rules built upon past work on compact set of representative patterns. A rule is redundant if it can be inferred from another rule having the same support and confidence. When using the set of mined rules as a composite filter, replacing a full set of rules with a non-redundant subset of the rules does not impact the accuracy of the filter. We consider several rule sets based on composition of various types of pattern sets-generators, projected-database generators, closed patterns and projected-database closed patterns. We investigate the completeness and tightness of these rule sets. We characterize a tight and complete set of non-redundant rules by defining it based on the composition of two pattern sets. Furthermore, we propose a compressed set of non-redundant rules in a spirit similar to how closed patterns serve as a compressed representation of a full set of patterns. Lastly, we propose an algorithm to mine this compressed set of non-redundant rules. A performance study shows that the proposed algorithm significantly improves both the runtime and compactness of mined rules over mining a full set of sequential rules.

Explore More