Jilles Vreeken | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jilles Vreeken is active.

Explore More

Publication

Featured researches published by Jilles Vreeken.

Data Mining and Knowledge Discovery | 2011

Krimp: mining itemsets that compress

Jilles Vreeken; Matthijs van Leeuwen; Arno Siebes

One of the major problems in pattern mining is the explosion of the number of results. Tight constraints reveal only common knowledge, while loose constraints lead to an explosion in the number of returned patterns. This is caused by large groups of patterns essentially describing the same set of transactions. In this paper we approach this problem using the MDL principle: the best set of patterns is that set that compresses the database best. For this task we introduce the Krimp algorithm. Experimental evaluation shows that typically only hundreds of itemsets are returned; a dramatic reduction, up to seven orders of magnitude, in the number of frequent item sets. These selections, called code tables, are of high quality. This is shown with compression ratios, swap-randomisation, and the accuracies of the code table-based Krimp classifier, all obtained on a wide range of datasets. Further, we extensively evaluate the heuristic choices made in the design of the algorithm.

international conference on data mining | 2012

Spotting Culprits in Epidemics: How Many and Which Ones?

B. Aditya Prakash; Jilles Vreeken; Christos Faloutsos

Given a snapshot of a large graph, in which an infection has been spreading for some time, can we identify those nodes from which the infection started to spread? In other words, can we reliably tell who the culprits are? In this paper we answer this question affirmatively, and give an efficient method called NETSLEUTH for the well-known Susceptible-Infected virus propagation model. Essentially, we are after that set of seed nodes that best explain the given snapshot. We propose to employ the Minimum Description Length principle to identify the best set of seed nodes and virus propagation ripple, as the one by which we can most succinctly describe the infected graph. We give an highly efficient algorithm to identify likely sets of seed nodes given a snapshot. Then, given these seed nodes, we show we can optimize the virus propagation ripple in a principled way by maximizing likelihood. With all three combined, NETSLEUTH can automatically identify the correct number of seed nodes, as well as which nodes are the culprits. Experimentation on our method shows high accuracy in the detection of seed nodes, in addition to the correct automatic identification of their number. Moreover, we show NETSLEUTH scales linearly in the number of nodes of the graph.

ieee intelligent vehicles symposium | 2004

Simulation and optimization of traffic in a city

Marco Wiering; Jilles Vreeken; J. van Veenen; Arne Koopman

Optimal traffic light control is a multi-agent decision problem, for which we propose to use reinforcement learning algorithms. Our algorithm learns the expected waiting times of cars for red and green lights at each intersection, and sets the traffic lights to green for the configuration maximizing individual car gains. For testing our adaptive traffic light controllers, we developed the green light district simulator. The experimental results show that the adaptive algorithms can strongly reduce average waiting times of cars compared to three hand-designed controllers.

knowledge discovery and data mining | 2012

The long and the short of it: summarising event sequences with serial episodes

Nikolaj Tatti; Jilles Vreeken

An ideal outcome of pattern mining is a small set of informative patterns, containing no redundancy or noise, that identifies the key structure of the data at hand. Standard frequent pattern miners do not achieve this goal, as due to the pattern explosion typically very large numbers of highly redundant patterns are returned. We pursue the ideal for sequential data, by employing a pattern set mining approach - an approach where, instead of ranking patterns individually, we consider results as a whole. Pattern set mining has been successfully applied to transactional data, but has been surprisingly understudied for sequential data. In this paper, we employ the MDL principle to identify the set of sequential patterns that summarises the data best. In particular, we formalise how to encode sequential data using sets of serial episodes, and use the encoded length as a quality score. As search strategy, we propose two approaches: the first algorithm selects a good pattern set from a large candidate set, while the second is a parameter-free any-time algorithm that mines pattern sets directly from the data. Experimentation on synthetic and real data demonstrates we efficiently discover small sets of informative patterns.

conference on information and knowledge management | 2012

Fast and reliable anomaly detection in categorical data

Leman Akoglu; Hanghang Tong; Jilles Vreeken; Christos Faloutsos

Spotting anomalies in large multi-dimensional databases is a crucial task with many applications in finance, health care, security, etc. We introduce COMPREX, a new approach for identifying anomalies using pattern-based compression. Informally, our method finds a collection of dictionaries that describe the norm of a database succinctly, and subsequently flags those points dissimilar to the norm---with high compression cost---as anomalies. Our approach exhibits four key features: 1) it is parameter-free; it builds dictionaries directly from data, and requires no user-specified parameters such as distance functions or density and similarity thresholds, 2) it is general; we show it works for a broad range of complex databases, including graph, image and relational databases that may contain both categorical and numerical features, 3) it is scalable; its running time grows linearly with respect to both database size as well as number of dimensions, and 4) it is effective; experiments on a broad range of datasets show large improvements in both compression, as well as precision in anomaly detection, outperforming its state-of-the-art competitors.

knowledge discovery and data mining | 2007

Characterising the difference

Jilles Vreeken; Matthijs van Leeuwen; Arno Siebes

Characterising the differences between two databases is an often occurring problem in Data Mining. Detection of change over time is a prime example, comparing databases from two branches is another one. The key problem is to discover the patterns that describe the difference. Emerging patterns provide only a partial answer to this question. In previous work, we showed that the data distribution can be captured in a pattern-based model using compression [12]. Here, we extend this approach to define a generic dissimilarity measure on databases. Moreover, we show that this approach can identify those patterns that characterise the differences between two distributions. Experimental results show that our method provides a well-founded way to independently measure database dissimilarity that allows for thorough inspection of the actual differences. This illustrates the use of our approach in real world data mining.

european conference on principles of data mining and knowledge discovery | 2006

Compression picks item sets that matter

Matthijs van Leeuwen; Jilles Vreeken; Arno Siebes

Finding a comprehensive set of patterns that truly captures the characteristics of a database is a complicated matter. Frequent item set mining attempts this, but low support levels often result in exorbitant amounts of item sets. Recently we showed that by using MDL we are able to select a small number of item sets that compress the data well [11]. Here we show that this small set is a good approximation of the underlying data distribution. Using the small set in a MDL-based classifier leads to performance on par with well-known rule-induction and association-rule based methods. Advantages are that no parameters need to be set manually and only very few item sets are used. The classification scores indicate that selecting item sets through compression is an elegant way of mining interesting patterns that can subsequently find use in many applications.

knowledge discovery and data mining | 2011

MIME: a framework for interactive visual pattern mining

Bart Goethals; Sandy Moens; Jilles Vreeken

We present a framework for interactive visual pattern mining. Our system enables the user to browse through the data and patterns easily and intuitively, using a toolbox consisting of interestingness measures, mining algorithms and post-processing algorithms to assist in identifying interesting patterns. By mining interactively, we enable the user to combine their subjective interestingness measure and background knowledge with a wide variety of objective measures to easily and quickly mine the most important and interesting patterns. Basically, we enable the user to become an essential part of the mining algorithm. Our demo currently applies to mining interesting itemsets and association rules, and its extension to episodes and decision trees is ongoing.

ACM Transactions on Knowledge Discovery From Data | 2014

MDL4BMF: Minimum Description Length for Boolean Matrix Factorization

Pauli Miettinen; Jilles Vreeken

Matrix factorizations—where a given data matrix is approximated by a product of two or more factor matrices—are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the “model order selection problem” of determining the proper rank of the factorization, that is, to answer where fine-grained structure stops, and where noise starts. Boolean Matrix Factorization (BMF)—where data, factors, and matrix product are Boolean—has in recent years received increased attention from the data mining community. The technique has desirable properties, such as high interpretability and natural sparsity. Yet, so far no method for selecting the correct model order for BMF has been available. In this article, we propose the use of the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this well-founded approach has numerous benefits; for example, it is automatic, does not require a likelihood function, is fast, and, as experiments show, is highly accurate. We formulate the description length function for BMF in general—making it applicable for any BMF algorithm. We discuss how to construct an appropriate encoding: starting from a simple and intuitive approach, we arrive at a highly efficient data-to-model--based encoding for BMF. We extend an existing algorithm for BMF to use MDL to identify the best Boolean matrix factorization, analyze the complexity of the problem, and perform an extensive experimental evaluation to study its behavior.

ACM Transactions on Knowledge Discovery From Data | 2012

Summarizing data succinctly with the most informative itemsets

Michael Mampaey; Jilles Vreeken; Nikolaj Tatti

Knowledge discovery from data is an inherently iterative process. That is, what we know about the data greatly determines our expectations, and therefore, what results we would find interesting and/or surprising. Given new knowledge about the data, our expectations will change. Hence, in order to avoid redundant results, knowledge discovery algorithms ideally should follow such an iterative updating procedure. With this in mind, we introduce a well-founded approach for succinctly summarizing data with the most informative itemsets; using a probabilistic maximum entropy model, we iteratively find the itemset that provides us the most novel information—that is, for which the frequency in the data surprises us the most—and in turn we update our model accordingly. As we use the maximum entropy principle to obtain unbiased probabilistic models, and only include those itemsets that are most informative with regard to the current model, the summaries we construct are guaranteed to be both descriptive and nonredundant. The algorithm that we present, called mtv, can either discover the top-k most informative itemsets, or we can employ either the Bayesian Information Criterion (bic) or the Minimum Description Length (mdl) principle to automatically identify the set of itemsets that together summarize the data well. In other words, our method will “tell you what you need to know” about the data. Importantly, it is a one-phase algorithm: rather than picking itemsets from a user-provided candidate set, itemsets and their supports are mined on-the-fly. To further its applicability, we provide an efficient method to compute the maximum entropy distribution using Quick Inclusion-Exclusion. Experiments on our method, using synthetic, benchmark, and real data, show that the discovered summaries are succinct, and correctly identify the key patterns in the data. The models they form attain high likelihoods, and inspection shows that they summarize the data well with increasingly specific, yet nonredundant itemsets.

Explore More