Steven de Rooij | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Steven de Rooij is active.

Explore More

Publication

Featured researches published by Steven de Rooij.

conference on learning theory | 2005

Asymptotic log-loss of prequential maximum likelihood codes

Peter Grünwald; Steven de Rooij

We analyze the Dawid-Rissanen prequential maximum likelihood codes relative to one-parameter exponential family models M. If data are i.i.d. according to an (essentially) arbitrary P, then the redundancy grows at rate 1/2 c ln n. We show that c = σ 2 1 /σ 2 2 , where σ 2 1 is the variance of P, and σ 2 2 is the variance of the distribution M* ∈ M that is closest to P in KL divergence. This shows that prequential codes behave quite differently from other important universal codes such as the 2-part MDL, Shtarkov and Bayes codes, for which c = 1. This behavior is undesirable in an MDL model selection setting.

Journal of Web Semantics | 2015

Substructure counting graph kernels for machine learning from RDF data

Gerben Klaas Dirk de Vries; Steven de Rooij

In this paper we introduce a framework for learning from RDF data using graph kernels that count substructures in RDF graphs, which systematically covers most of the existing kernels previously defined and provides a number of new variants. Our definitions include fast kernel variants that are computed directly on the RDF graph. To improve the performance of these kernels we detail two strategies. The first strategy involves ignoring the vertex labels that have a low frequency among the instances. Our second strategy is to remove hubs to simplify the RDF graphs. We test our kernels in a number of classification experiments with real-world RDF datasets. Overall the kernels that count subtrees show the best performance. However, they are closely followed by simple bag of labels baseline kernels. The direct kernels substantially decrease computation time, while keeping performance the same. For the walks counting kernel this decrease in computation time is so large that it thereby becomes a computationally viable kernel to use. Ignoring low frequency labels improves the performance for all datasets. The hub removal algorithm increases performance on two out of three of our smaller datasets, but has little impact when used on our larger datasets. Systematic graph kernel framework for RDF.Fast computation algorithms.Low frequency labels and hub removal on RDF to enhance machine learning.

european semantic web conference | 2015

A Compact In-Memory Dictionary for RDF Data

Hamid R. Bazoobandi; Steven de Rooij; Jacopo Urbani; Annette ten Teije; Frank van Harmelen; Henri E. Bal

While almost all dictionary compression techniques focus on static RDF data, we present a compact in-memory RDF dictionary for dynamic and streaming data. To do so, we analysed the structure of terms in real-world datasets and observed a high degree of common prefixes. We studied the applicability of Trie data structures on RDF data to reduce the memory occupied by common prefixes and discovered that all existing Trie implementations lead to either poor performance, or an excessive memory wastage. In our approach, we address the existing limitations of Tries for RDF data, and propose a new variant of Trie which contains some optimizations explicitly designed to improve the performance on RDF data. Furthermore, we show how we use this Trie as an in-memory dictionary by using as numerical ID a memory address instead of an integer counter. This design removes the need for an additional decoding data structure, and further reduces the occupied memory. An empirical analysis on real-world datasets shows that with a reasonable overhead our technique uses 50---59% less memory than a conventional uncompressed dictionary.

Statistics & Probability Letters | 2011

Insuring against loss of evidence in game-theoretic probability

A. Philip Dawid; Steven de Rooij; Glenn Shafer; Alexander Shen; Nikolai K. Vereshchagin; Vladimir Vovk

Statistical testing can be framed as a repetitive game between two players, Forecaster and Sceptic. In each round, Forecaster sets prices for various gambles, and Sceptic chooses which gambles to make. If Sceptic multiplies by a large factor the capital he puts at risk, he has evidence against Forecasters ability. His capital at the end of each round is a measure of his evidence against Forecaster so far. This can go up and then back down. If you report the maximum so far instead of the current value, you are exaggerating the evidence against Forecaster. In this article, we show how to remove the exaggeration. Removing it means systematically reducing the maximum in such a way that a rival to Sceptic can always play so as to obtain current evidence as good as Sceptics reduced maximum. We characterize the functions that can achieve such reductions. Because these functions may impose only modest reductions, we think of our result as a method of insuring against loss of evidence. In the context of an actual market, it is a method of insuring against the loss of what an investor has gained so far.

Journal of Computer and System Sciences | 2011

Luckiness and Regret in Minimum Description Length Inference

Steven de Rooij; Peter Grünwald

htmlabstractMinimum Description Length (MDL) inference is based on the intuition that understanding the available data can be defined in terms of the ability to compress the data, i.e. to describe it in full using a shorter representation. This brief introduction discusses the design of the various codes used to implement MDL, focusing on the philosophically intriguing concepts of luckiness and regret: a good MDL code exhibits good performance in the worst case over all possible data sets, but achieves even better performance when the data turn out to be simple (although we suggest making no a priori assumptions to that effect). We then discuss how data compression relates to performance in various learning tasks, including parameter estimation, parametric and nonparametric model selection and sequential prediction of outcomes from an unknown source. Last, we briefly outline the history of MDL and its technical and philosophical relationship to other approaches to learning such as Bayesian, frequentist and prequential statistics.

IEEE Transactions on Information Theory | 2013

Universal Codes From Switching Strategies

Wouter M. Koolen; Steven de Rooij

We discuss algorithms for combining sequential prediction strategies, a task which can be viewed as a natural generalization of the concept of universal coding. We describe a graphical language based on hidden Markov models for defining prediction strategies, and we provide both existing and new models as examples. The models include efficient, parameterless models for switching between the input strategies over time, including a model for the case where switches tend to occur in clusters, and finally a new model for the scenario where the prediction strategies have a known relationship, and where jumps are typically between strongly related ones. This last model is relevant for coding time series data where parameter drift is expected. As theoretical contributions, we introduce an interpolation construction that is useful in the development and analysis of new algorithms, and we establish a new sophisticated lemma for analyzing the individual sequence regret of parameterized models.

international semantic web conference | 2016

Are Names Meaningful? Quantifying Social Meaning on the Semantic Web

Steven de Rooij; Wouter Beek; Peter Bloem; Frank van Harmelen; Stefan Schlobach

According to its model-theoretic semantics, Semantic Web IRIs are individual constants or predicate letters whose names are chosen arbitrarily and carry no formal meaning. At the same time it is a well-known aspect of Semantic Web pragmatics that IRIs are often constructed mnemonically, in order to be meaningful to a human interpreter. The latter has traditionally been termed ‘social meaning’, a concept that has been discussed but not yet quantitatively studied by the Semantic Web community. In this paper we use measures of mutual information content and methods from statistical model learning to quantify the meaning that is (at least) encoded in Semantic Web names. We implement the approach and evaluate it over hundreds of thousands of datasets in order to illustrate its efficacy. Our experiments confirm that many Semantic Web names are indeed meaningful and, more interestingly, we provide a quantitative lower bound on how much meaning is encoded in names on a per-dataset basis. To our knowledge, this is the first paper about the interaction between social and formal meaning, as well as the first paper that uses statistical model learning as a method to quantify meaning in the Semantic Web context. These insights are useful for the design of a new generation of Semantic Web tools that take such social meaning into account.

algorithmic learning theory | 2014

A Safe Approximation for Kolmogorov Complexity

Peter Bloem; Francisco Mota; Steven de Rooij; Luís Antunes; Pieter W. Adriaans

Kolmogorov complexity (K) is an incomputable function. It can be approximated from above but not to arbitrary given precision and it cannot be approximated from below. By restricting the source of the data to a specific model class, we can construct a computable function \({\overline{\kappa}}\) to approximate K in a probabilistic sense: the probability that the error is greater than k decays exponentially with k. We apply the same method to the normalized information distance (NID) and discuss conditions that affect the safety of the approximation.

algorithmic learning theory | 2015

Two Problems for Sophistication

Peter Bloem; Steven de Rooij; Pieter W. Adriaans

Kolmogorov complexity measures the amount of information in data, but does not distinguish structure from noise. Kolmogorovs definition of the structure function was the first attempt to measure only the structural information in data, by measuring the complexity of the smallest model that allows for optimal compression of the data. Since then, many variations of this idea have been proposed, for which we use sophistication as an umbrella term. We describe two fundamental problems with existing proposals, showing many of them to be unsound. Consequently, we put forward the view that the problem is fundamental: it may be impossible to objectively quantify the sophistication.

Journal of Machine Learning Research | 2014