Bart Goethals | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bart Goethals is active.

Explore More

Publication

Featured researches published by Bart Goethals.

international conference on information security and cryptology | 2004

On private scalar product computation for privacy-preserving data mining

Bart Goethals; Sven Laur; Helger Lipmaa; Taneli Mielikäinen

In mining and integrating data from multiple sources, there are many privacy and security issues. In several different contexts, the security of the full privacy-preserving data mining protocol depends on the security of the underlying private scalar product protocol. We show that two of the private scalar product protocols, one of which was proposed in a leading data mining conference, are insecure. We then describe a provably private scalar product protocol that is based on homomorphic encryption and improve its efficiency so that it can also be used on massive datasets.

european conference on principles of data mining and knowledge discovery | 2002

Mining All Non-derivable Frequent Itemsets

Tgk Toon Calders; Bart Goethals

Recent studies on frequent itemset mining algorithms resulted in significant performance improvements. However, if the minimal support threshold is set too low, or the data is highly correlated, the number of frequent itemsets itself can be prohibitively large. To overcome this problem, recently several proposals have been made to construct a concise representation of the frequent itemsets, instead of mining all frequent itemsets. The main goal of this paper is to identify redundancies in the set of all frequent itemsets and to exploit these redundancies in order to reduce the result of a mining operation. We present deduction rules to derive tight bounds on the support of candidate itemsets. We show how the deduction rules allow for constructing a minimal representation for all frequent itemsets. We also present connections between our proposal and recent proposals for concise representations and we give the results of experiments on real-life datasets that show the effectiveness of the deduction rules. In fact, the experiments even show that in many cases, first mining the concise representation, and then creating the frequent itemsets from this representation outperforms existing frequent set mining algorithms.

mining software repositories | 2010

Predicting the severity of a reported bug

Ahmed Lamkanfi; Serge Demeyer; Emanuel Giger; Bart Goethals

The severity of a reported bug is a critical factor in deciding how soon it needs to be fixed. Unfortunately, while clear guidelines exist on how to assign the severity of a bug, it remains an inherent manual process left to the person reporting the bug. In this paper we investigate whether we can accurately predict the severity of a reported bug by analyzing its textual description using text mining algorithms. Based on three cases drawn from the open-source community (Mozilla, Eclipse and GNOME), we conclude that given a training set of sufficient size (approximately 500 reports per severity), it is possible to predict the severity with a reasonable accuracy (both precision and recall vary between 0.65–0.75 with Mozilla and Eclipse; 0.70–0.85 in the case of GNOME).

Sigkdd Explorations | 2004

Advances in frequent itemset mining implementations: report on FIMI'03

Bart Goethals; Mohammed Javeed Zaki

1. WHY ORGANIZE FIMI? Since the introduction of association rule mining in 1993 by Agrawal Imielinski and Swami [3], the frequent itemset mining (FIM) tasks have received a great deal of attention. Within the last decade, a phenomenal number of algorithms have been developed for mining all [3; 5; 20; 4; 27; 24; 29; 34; 10; 22; 19; 32], closed [25; 6; 12; 26; 30; 23; 31; 28; 33] and maximal frequent itemsets [18; 21; 7; 2; 1; 11; 36; 16; 17]. Every new paper claims to run faster than previously existing algorithms, based on their experimental testing, which is oftentimes quite limited in scope, since many of the original algorithms are not available due to intellectual property and copyright issues. Zheng, Kohavi and Mason [35] observed that the performance of several of these algorithms is not always as claimed by its authors, when tested on some different datasets. Also, from personal experience, we noticed that even different implementations of the same algorithm could behave quite differently for various datasets and parameters. Given this proliferation of FIM algorithms, and sometimes contradictory claims, there is a pressing need to benchmark, characterize and understand the algorithmic performance space. We would like to understand why and under what conditions one algorithm outperforms another. This means testing the methods for a wide variety of parameters, and on different datasets spanning dense and sparse, real and synthetic, small and large, and so on. Given the experimental, algorithmic nature of FIM (and most of data mining in general), it is crucial that other researchers be able to independently verify the claims made in a new paper. Unfortunately, the FIM community (with few exceptions) has a very poor track record in this regard. Many new algorithms are not available even as an executable, let alone the source code. How many times have we heard “this is proprietary software, and not available.” This is not the way other sciences work. Independent verifiability is the hallmark of sciences like physics, chemistry, biology, and so on. One may argue, that the nature of research is different, they have detailed experimental procedure that can be replicated, while we have algorithms, and there is more than one way to code an algorithm. However, a good example to emulate is the bioinformatics community. They have espoused the open-source paradigm with more alacrity than we have. It is quite common for journals and conferences in bioinformatics to require that software be available. For example, here is a direct quote from the journal Bioinformatics (http://bioinformatics.oupjournals.org/):

Data Mining and Knowledge Discovery | 2007

Non-derivable itemset mining

Tgk Toon Calders; Bart Goethals

All frequent itemset mining algorithms rely heavily on the monotonicity principle for pruning. This principle allows for excluding candidate itemsets from the expensive counting phase. In this paper, we present sound and complete deduction rules to derive bounds on the support of an itemset. Based on these deduction rules, we construct a condensed representation of all frequent itemsets, by removing those itemsets for which the support can be derived, resulting in the so called Non-Derivable Itemsets (NDI) representation. We also present connections between our proposal and recent other proposals for condensed representations of frequent itemsets. Experiments on real-life datasets show the effectiveness of the NDI representation, making the search for frequent non-derivable itemsets a useful and tractable alternative to mining all frequent itemsets.

international conference on big data | 2013

Frequent Itemset Mining for Big Data

Sandy Moens; Emin Aksehirli; Bart Goethals

Frequent Itemset Mining (FIM) is one of the most well known techniques to extract knowledge from data. The combinatorial explosion of FIM methods become even more problematic when they are applied to Big Data. Fortunately, recent improvements in the field of parallel programming already provide good tools to tackle this problem. However, these tools come with their own technical challenges, e.g. balanced data distribution and inter-communication costs. In this paper, we investigate the applicability of FIM techniques on the MapReduce platform. We introduce two new methods for mining large datasets: Dist-Eclat focuses on speed while BigFIM is optimized to run on really large datasets. In our experiments we show the scalability of our methods.

Genome Biology | 2011

BioGraph: unsupervised biomedical knowledge discovery via automated hypothesis generation

Anthony Liekens; Jeroen De Knijf; Walter Daelemans; Bart Goethals; Peter De Rijk; Jurgen Del-Favero

We present BioGraph, a data integration and data mining platform for the exploration and discovery of biomedical information. The platform offers prioritizations of putative disease genes, supported by functional hypotheses. We show that BioGraph can retrospectively confirm recently discovered disease genes and identify potential susceptibility genes, outperforming existing technologies, without requiring prior domain knowledge. Additionally, BioGraph allows for generic biomedical applications beyond gene discovery. BioGraph is accessible at http://www.biograph.be.

knowledge discovery and data mining | 2000

A data mining framework for optimal product selection in retail supermarket data: the generalized PROFSET model

Tom Brijs; Bart Goethals; Gilbert Swinnen; Koen Vanhoof; Geert Wets

In recent years, data mining researchers have developed efficient association rule algorithms for retail market basket analysis. Still, retailers often complain about how to adopt association rules to optimize concrete retail marketing-mix decisions. It is in this context that, in a previous paper, the authors have introduced a product selection model called PROFSET. This model selects the most interesting products from a product assortment based on their cross-selling potential given some retailer defined constraints. However this model suffered from an important deficiency: it could not deal effectively with supermarket data, and no provisions were taken to include retail category management principles. Therefore, in this paper, the authors present an important generalization of the existing model in order to make it suitable for supermarket data as well, and to enable retailers to add category restrictions to the model. Experiments on real world data obtained from a Belgian supermarket chain produce very promising results and demonstrate the effectiveness of the generalized PROFSET model.

ACM Transactions on Database Systems | 2005

Tight upper bounds on the number of candidate patterns

Floris Geerts; Bart Goethals; Jan Van den Bussche

In the context of mining for frequent patterns using the standard levelwise algorithm, the following question arises: given the current level and the current set of frequent patterns, what is the maximal number of candidate patterns that can be generated on the next level? We answer this question by providing tight upper bounds, derived from a combinatorial result from the sixties by Kruskal and Katona. Our result is useful to secure existing algorithms from a combinatorial explosion of the number of candidate patterns.

international conference on data mining | 2010

Approximation of Frequentness Probability of Itemsets in Uncertain Data

Toon Calders; Calin Garboni; Bart Goethals

Mining frequent item sets from transactional datasets is a well known problem with good algorithmic solutions. Most of these algorithms assume that the input data is free from errors. Real data, however, is often affected by noise. Such noise can be represented by uncertain datasets in which each item has an existence probability. Recently, Bernecker et al. (2009) proposed the frequentness probability, i.e., the probability that a given item set is frequent, to select item sets in an uncertain database. A dynamic programming approach to evaluate this measure was given as well. We argue, however, that for the setting of Bernecker et al. (2009), that assumes independence between the items, already well-known statistical tools exist. We show how the frequentness probability can be approximated extremely accurately using a form of the central limit theorem. We experimentally evaluated our approximation and compared it to the dynamic programming approach. The evaluation shows that our approximation method is extremely accurate even for very small databases while at the same time it has much lower memory overhead and computation time.

Explore More