Arno J. Knobbe
Leiden University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Arno J. Knobbe.
european conference on machine learning | 2008
Dennis Leman; Ad Feelders; Arno J. Knobbe
In most databases, it is possible to identify small partitions of the data where the observed distribution is notably different from that of the database as a whole. In classical subgroup discovery, one considers the distribution of a single nominal attribute, and exceptional subgroups show a surprising increase in the occurrence of one of its values. In this paper, we introduce Exceptional Model Mining(EMM), a framework that allows for more complicated target concepts. Rather than finding subgroups based on the distribution of a single target attribute, EMM finds subgroups where a model fitted to that subgroup is somehow exceptional. We discuss regression as well as classification models, and define quality measures that determine how exceptional a given model on a subgroup is. Our framework is general enough to be applied to many types of models, even from other paradigms such as association analysis and graphical modeling.
european conference on principles of data mining and knowledge discovery | 2001
Arno J. Knobbe; Marc de Haas; Arno Siebes
The fact that data is scattered over many tables causes many problems in the practice of data mining. To deal with this problem, one either constructs a single table by hand, or one uses a Multi-Relational Data Mining algorithm. In this paper, we propose a different approach in which the single table is constructed automatically using aggregate functions, which repeatedly summarise information from different tables over associations in the datamodel. Following the construction of the single table, we apply traditional data mining algorithms. Next to an in-depth discussion of our approach, the paper presents results of experiments on three well-known data sets.
knowledge discovery and data mining | 2006
Arno J. Knobbe; Eric K. Y. Ho
In this paper we present a new approach to mining binary data. We treat each binary feature (item) as a means of distinguishing two sets of examples. Our interest is in selecting from the total set of items an itemset of specified size, such that the database is partitioned with as uniform a distribution over the parts as possible. To achieve this goal, we propose the use of joint entropy as a quality measure for itemsets, and refer to optimal itemsets of cardinality k as maximally informative k-itemsets. We claim that this approach maximises distinctive power, as well as minimises redundancy within the feature set. A number of algorithms is presented for computing optimal itemsets efficiently.
Data Mining and Knowledge Discovery | 2012
Matthijs van Leeuwen; Arno J. Knobbe
Large data is challenging for most existing discovery algorithms, for several reasons. First of all, such data leads to enormous hypothesis spaces, making exhaustive search infeasible. Second, many variants of essentially the same pattern exist, due to (numeric) attributes of high cardinality, correlated attributes, and so on. This causes top-k mining algorithms to return highly redundant result sets, while ignoring many potentially interesting results. These problems are particularly apparent with subgroup discovery (SD) and its generalisation, exceptional model mining. To address this, we introduce subgroup set discovery: one should not consider individual subgroups, but sets of subgroups. We consider three degrees of redundancy, and propose corresponding heuristic selection strategies in order to eliminate redundancy. By incorporating these (generic) subgroup selection methods in a beam search, the aim is to improve the balance between exploration and exploitation. The proposed algorithm, dubbed DSSD for diverse subgroup set discovery, is experimentally evaluated and compared to existing approaches. For this, a variety of target types with corresponding datasets and quality measures is used. The subgroup sets that are discovered by the competing methods are evaluated primarily on the following three criteria: (1) diversity in the subgroup covers (exploration), (2) the maximum quality found (exploitation), and (3) runtime. The results show that DSSD outperforms each traditional SD method on all or a (non-empty) subset of these criteria, depending on the specific setting. The more complex the task, the larger the benefit of using our diverse heuristic search turns out to be.
Data Mining and Knowledge Discovery | 2016
Wouter Duivesteijn; Adrianus Feelders; Arno J. Knobbe
Finding subsets of a dataset that somehow deviate from the norm, i.e. where something interesting is going on, is a classical Data Mining task. In traditional local pattern mining methods, such deviations are measured in terms of a relatively high occurrence (frequent itemset mining), or an unusual distribution for one designated target attribute (common use of subgroup discovery). These, however, do not encompass all forms of “interesting”. To capture a more general notion of interestingness in subsets of a dataset, we develop Exceptional Model Mining (EMM). This is a supervised local pattern mining framework, where several target attributes are selected, and a model over these targets is chosen to be the target concept. Then, we strive to find subgroups: subsets of the dataset that can be described by a few conditions on single attributes. Such subgroups are deemed interesting when the model over the targets on the subgroup is substantially different from the model on the whole dataset. For instance, we can find subgroups where two target attributes have an unusual correlation, a classifier has a deviating predictive performance, or a Bayesian network fitted on several target attributes has an exceptional structure. We give an algorithmic solution for the EMM framework, and analyze its computational complexity. We also discuss some illustrative applications of EMM instances, including using the Bayesian network model to identify meteorological conditions under which food chains are displaced, and using a regression model to find the subset of households in the Chinese province of Hunan that do not follow the general economic law of demand.
international conference on data mining | 2010
Wouter Duivesteijn; Arno J. Knobbe; Ad Feelders; Matthijs van Leeuwen
Whenever a dataset has multiple discrete target variables, we want our algorithms to consider not only the variables themselves, but also the interdependencies between them. We propose to use these interdependencies to quantify the quality of subgroups, by integrating Bayesian networks with the Exceptional Model Mining framework. Within this framework, candidate subgroups are generated. For each candidate, we fit a Bayesian network on the target variables. Then we compare the network’s structure to the structure of the Bayesian network fitted on the whole dataset. To perform this comparison, we define an edit distance-based distance metric that is appropriate for Bayesian networks. We show interesting subgroups that we experimentally found with our method on datasets from music theory, semantic scene classification, biology and zoogeography.
european conference on principles of data mining and knowledge discovery | 2002
Arno J. Knobbe; Arno Siebes; Bart Marseille
The fact that data is scattered over many tables causes many problems in the practice of data mining. To deal with this problem, one either constructs a single table by propositionalisation, or uses a Multi-Relational Data Mining algorithm. In either case, one has to deal with the non-determinacy of one-to-many relationships. In propositionali-sation, aggregate functions have already proven to be powerful tools to handle this non-determinacy. In this paper we show how aggregate functions can be incorporated in the dynamic construction of patterns of Multi-Relational Data Mining
Plastic and Reconstructive Surgery | 2013
Peter J. F. M. Lohuis; Sara Hakim; Wouter Duivesteijn; Arno J. Knobbe; Abel-Jan Tasman
Background: The authors tested a short, practically designed questionnaire to assess changes in subjective perception of nasal appearance in patients before and after aesthetic rhinoplasty. Methods: A prospective cohort study was conducted in a group of 121 patients who desired aesthetic rhinoplasty and were operated on by one surgeon. The questionnaire contained five questions (E1-E5) based on a five-point Likert scale and a visual analogue scale (range, 0 to 10). Two questions were designed as trick questions to help the surgeon screen for signs of body dysmorphic disorder. Results: All patients rated the appearance of their nose as improved after surgery. The visual analogue scale revealed a Gaussian curve of normal distribution (range, 0.5 to 10) around a significant improvement (mean, 4.36 points, p = 0.018). Also, question E1, question E2, and the sum of questions E1 through E5 showed a statistically significant improvement after surgery (p = 1.74 × 10–36, p = 4.29 × 10–33, and p = 9.23 × 10–31, respectively). The authors found a linear relationship between preoperative score on the trick questions and postoperative increase in visual analogue scale score. Test-retest reliability could be investigated in 74 of 121 patients (61 percent) and showed a positive correlation between postoperative (1 year after surgery) and repostoperative response (2 to 4 years after surgery). Conclusions: The authors concluded that a surgeon performing aesthetic rhinoplasty can benefit from using this questionnaire. It is simple, takes no more than 2 minutes to complete, and provides helpful subjective information regarding patients’ preoperative nasal appearance and postoperative surgical outcome. CLINICAL QUESTION/LEVEL OF EVIDENCE: Therapeutic, IV.
international conference on data mining | 2011
Wouter Duivesteijn; Arno J. Knobbe
Subgroup discovery suffers from the multiple comparisons problem: we search through a large space, hence whenever we report a set of discoveries, this set will generally contain false discoveries. We propose a method to compare subgroups found through subgroup discovery with a statistical model we build for these false discoveries. We determine how much the subgroups we find deviate from the model, and hence statistically validate the found subgroups. Furthermore we propose to use this subgroup validation to objectively compare quality measures used in subgroup discovery, by determining how much the top subgroups we find with each measure deviate from the statistical model generated with that measure. We thus aim to determine how good individual measures are in selecting significant findings. We invoke our method to experimentally compare popular quality measures in several subgroup discovery settings.
european conference on machine learning | 2011
Matthijs van Leeuwen; Arno J. Knobbe
Large and complex data is challenging for most existing discovery algorithms, for several reasons. First of all, such data leads to enormous hypothesis spaces, making exhaustive search infeasible. Second, many variants of essentially the same pattern exist, due to (numeric) attributes of high cardinality, correlated attributes, and so on. This causes top-k mining algorithms to return highly redundant result sets, while ignoring many potentially interesting results. These problems are particularly apparent with Subgroup Discovery and its generalisation, Exceptional Model Mining. To address this, we introduce subgroup set mining: one should not consider individual subgroups, but sets of subgroups. We consider three degrees of redundancy, and propose corresponding heuristic selection strategies in order to eliminate redundancy. By incorporating these strategies in a beam search, the balance between exploration and exploitation is improved. Experiments clearly show that the proposed methods result in much more diverse subgroup sets than traditional Subgroup Discovery methods.