Toon Calders | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Toon Calders is active.

Explore More

Publication

Featured researches published by Toon Calders.

conference on software maintenance and reengineering | 2005

Applying Webmining techniques to execution traces to support the program comprehension process

Andy Zaidman; Toon Calders; Serge Demeyer; Jan Paredaens

Well-designed object-oriented programs typically consist of a few key classes that work tightly together to provide the bulk of the functionality. As such, these key classes are excellent starting points for the program comprehension process. We propose a technique that uses Webmining principles on execution traces to discover these important and tightly interacting classes. Based on two medium-scale case studies - Apache Ant and Jakarta JMeter - and detailed architectural information from its developers, we show that our heuristic does in fact find a sizeable number of the classes deemed important by the developers.

international conference on data mining | 2010

Approximation of Frequentness Probability of Itemsets in Uncertain Data

Toon Calders; Calin Garboni; Bart Goethals

Mining frequent item sets from transactional datasets is a well known problem with good algorithmic solutions. Most of these algorithms assume that the input data is free from errors. Real data, however, is often affected by noise. Such noise can be represented by uncertain datasets in which each item has an existence probability. Recently, Bernecker et al. (2009) proposed the frequentness probability, i.e., the probability that a given item set is frequent, to select item sets in an uncertain database. A dynamic programming approach to evaluate this measure was given as well. We argue, however, that for the setting of Bernecker et al. (2009), that assumes independence between the items, already well-known statistical tools exist. We show how the frequentness probability can be approximated extremely accurately using a form of the central limit theorem. We experimentally evaluated our approximation and compared it to the dynamic programming approach. The evaluation shows that our approximation method is extremely accurate even for very small databases while at the same time it has much lower memory overhead and computation time.

knowledge discovery and data mining | 2010

Efficient pattern mining of uncertain data with sampling

Toon Calders; Calin Garboni; Bart Goethals

Mining frequent itemsets from transactional datasets is a well known problem with good algorithmic solutions. In the case of uncertain data, however, several new techniques have been proposed. Unfortunately, these proposals often suffer when a lot of items occur with many different probabilities. Here we propose an approach based on sampling by instantiating “possible worlds” of the uncertain data, on which we subsequently run optimized frequent itemset mining algorithms. As such we gain efficiency at a surprisingly low loss in accuracy. These is confirmed by a statistical and an empirical evaluation on real and synthetic data.

international conference on data mining | 2007

Mining Frequent Itemsets in a Stream

Toon Calders; Nele Dexters; Bart Goethals

We study the problem of finding frequent itemsets in a continuous stream of transactions. The current frequency of an itemset in a stream is defined as its maximal frequency over all possible windows in the stream from any point in the past until the current state that satisfy a minimal length constraint. Properties of this new measure are studied and an incremental algorithm that allows, at any time, to immediately produce the current frequencies of all frequent itemsets is proposed. Experimental and theoretical analysis show that the space requirements for the algorithm are extremely small for many realistic data distributions.

symposium on principles of database systems | 2004

Computational complexity of itemset frequency satisfiability

Toon Calders

Computing frequent itemsets is one of the most prominent problems in data mining. We introduce a new, related problem, called FREQSAT: given some itemset-interval pairs, does there exist a database such that for every pair the frequency of the itemset falls in the interval? It is shown in this paper that FREQSAT is not finitely axiomatizable and that it is NP-complete. We also study cases in which other characteristics of the database are given as well. These characteristics can complicate FREQSAT even more. For example, when the maximal number of duplicates of a transaction is known, FREQSAT becomes PP-hard. We describe applications of FREQSAT in frequent itemset mining algorithms and privacy in data mining.

european conference on principles of data mining and knowledge discovery | 2007

Efficient AUC Optimization for Classification

Toon Calders; Szymon Jaroszewicz

In this paper we show an efficient method for inducing classifiers that directly optimize the area under the ROC curve. Recently, AUC gained importance in the classification community as a mean to compare the performance of classifiers. Because most classification methods do not optimize this measure directly, several classification learning methods are emerging that directly optimize the AUC. These methods, however, require many costly computations of the AUC, and hence, do not scale well to large datasets. In this paper, we develop a method to increase the efficiency of computing AUC based on a polynomial approximation of the AUC. As a proof of concept, the approximation is plugged into the construction of a scalable linear classifier that directly optimizes AUC using a gradient descent method. Experiments on real-life datasets show a high accuracy and efficiency of the polynomial approximation.

knowledge discovery and data mining | 2006

Mining rank-correlated sets of numerical attributes

Toon Calders; Bart Goethals; Szymon Jaroszewicz

We study the mining of interesting patterns in the presence of numerical attributes. Instead of the usual discretization methods, we propose the use of rank based measures to score the similarity of sets of numerical attributes. New support measures for numerical data are introduced, based on extensions of Kendalls tau, and Spearmans Footrule and rho. We show how these support measures are related. Furthermore, we introduce a novel type of pattern combining numerical and categorical attributes. We give efficient algorithms to find all frequent patterns for the proposed support measures, and evaluate their performance on real-life datasets.

ACM Transactions on Database Systems | 2002

Searching for dependencies at multiple abstraction levels

Toon Calders; Raymond T. Ng; Jef Wijsen

The notion of roll-up dependency (RUD) extends functional dependencies with generalization hierarchies. RUDs can be applied in OLAP and database design. The problem of discovering RUDs in large databases is at the center of this paper. An algorithm is provided that relies on a number of theoretical results. The algorithm has been implemented; results on two real-life datasets are given. The extension of functional dependency (FD) with roll-ups turns out to capture meaningful rules that are outside the scope of classical FD mining. Performance figures show that RUDs can be discovered in linear time in the number of tuples of the input dataset.

european conference on principles of data mining and knowledge discovery | 2006

Integrating pattern mining in relational databases

Toon Calders; Bart Goethals; Adriana Prado

Almost a decade ago, Imielinski and Mannila introduced the notion of Inductive Databases to manage KDD applications just as DBMSs successfully manage business applications. The goal is to follow one of the key DBMS paradigms: building optimizing compilers for ad hoc queries. During the past decade, several researchers proposed extensions to the popular relational query language, SQL, in order to express such mining queries. In this paper, we propose a completely different and new approach, which extends the DBMS itself, not the query language, and integrates the mining algorithms into the database query optimizer. To this end, we introduce virtual mining views, which can be queried as if they were traditional relational tables (or views). Every time the database system accesses one of these virtual mining views, a mining algorithm is triggered to materialize all tuples needed to answer the query. We show how this can be done effectively for the popular association rule and frequent set mining problems.

Statistical Analysis and Data Mining | 2014

Mining Compressing Sequential Patterns

Hoang Thanh Lam; Fabian Mörchen; Dmitriy Fradkin; Toon Calders

Pattern mining based on data compression has been successfully applied in many data mining tasks. For itemset data, the Krimp algorithm based on the minimumdescription length MDL principle was shown to be very effective in solving the redundancy issue in descriptive pattern mining. However, for sequence data, the redundancy issue of the set of frequent sequential patterns is not fully addressed in the literature. In this article, we study MDL-based algorithms for mining non-redundant sets of sequential patterns from a sequence database. First, we propose an encoding scheme for compressing sequence data with sequential patterns. Second, we formulate the problem of mining the most compressing sequential patterns from a sequence database. We show that this problem is intractable and belongs to the class of inapproximable problems. Therefore, we propose two heuristic algorithms. The first of these uses a two-phase approach similar to Krimp for itemset data. To overcome performance issues in candidate generation, we also propose GoKrimp, an algorithm that directly mines compressing patterns by greedily extending a pattern until no additional compression benefit of adding the extension into the dictionary. Since checks for additional compression benefit of an extension are computationally expensive we propose a dependency test which only chooses related events for extending a given pattern. This technique improves the efficiency of the GoKrimp algorithm significantly while it still preserves the quality of the set of patterns. We conduct an empirical study on eight datasets to show the effectiveness of our approach in comparison to the state-of-the-art algorithms in terms of interpretability of the extracted patterns, run time, compression ratio, and classification accuracy using the discovered patterns as features for different classifiers.

Explore More