Peter Grünwald
Centrum Wiskunde & Informatica
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Peter Grünwald.
Annals of Statistics | 2004
Peter Grünwald; A. Philip Dawid
We describe and develop a close relationship between two problems that have customarily been regarded as distinct: that of maximizing entropy, and that of minimizing worst-case expected loss. Using a formulation grounded in the equilibrium theory of zero-sum games between Decision Maker and Nature, these two problems are shown to be dual to each other, the solution to each providing that to the other. Although Topsoe described this connection for the Shannon entropy over 20 years ago, it does not appear to be widely known even in that important special case. We here generalize this theory to apply to arbitrary decision problems and loss functions. We indicate how an appropriate generalized definition of entropy can be associated with such a problem, and we show that, subject to certain regularity conditions, the above-mentioned duality continues to apply in this extended context. This simultaneously provides a possible rationale for maximizing entropy and a tool for finding robust Bayes acts. We also describe the essential identity between the problem of maximizing entropy and that of minimizing a related discrepancy or divergence between distributions. This leads to an extension, to arbitrary discrepancies, of a well-known minimax theorem for the case of Kullback-Leibler divergence (the redundancy-capacity theorem of information theory). For the important case of families of distributions having certain mean values specified, we develop simple sufficient conditions and methods for identifying the desired solutions. We use this theory to introduce a new concept of generalized exponential family linked to the specific decision problem under consideration, and we demonstrate that this shares many of the properties of standard exponential families. Finally, we show that the existence of an equilibrium in our game can be rephrased in terms of a Pythagorean property of the related divergence, thus generalizing previously announced results for Kullback-Leibler and Bregman divergences.
Machine Learning | 2005
Teemu Roos; Hannes Wettig; Peter Grünwald; Petri Myllymäki; Henry Tirri
Discriminative learning of the parameters in the naive Bayes model is known to be equivalent to a logistic regression problem. Here we show that the same fact holds for much more general Bayesian network models, as long as the corresponding network structure satisfies a certain graph-theoretic property. The property holds for naive Bayes but also for more complex structures such as tree-augmented naive Bayes (TAN) as well as for mixed diagnostic-discriminative structures. Our results imply that for networks satisfying our property, the conditional likelihood cannot have local maxima so that the global maximum can be found by simple local optimization methods. We also show that if this property does not hold, then in general the conditional likelihood can have local, non-global maxima. We illustrate our theoretical results by empirical experiments with local optimization in a conditional naive Bayes model. Furthermore, we provide a heuristic strategy for pruning the number of parameters and relevant features in such models. For many data sets, we obtain good results with heavily pruned submodels containing many fewer parameters than the original naive Bayes model.
international joint conference on artificial intelligence | 1996
Peter Grünwald
In forming hollow, molecularly oriented articles having a substantially oval or substantially equilateral triangular cross section from thermoplastic preforms by a method which includes distending the preforms in a mold while at molecular orientation temperature, the improvement providing improved material distribution in regions furthest from the axes of the articles when the thermoplastic is moldable polyalkylene terephthalate having an inherent viscosity of at least 0.55 which involves limiting the corner definition ratio to between about 3 to about 9 and the circular deviation ratio to no greater than about 2.4 at the cross section during distension and controlling axial and maximum radial stretch of the preforms within predetermined limits.
Statistics and Computing | 2000
Petri Kontkanen; Petri Myllymäki; Tomi Silander; Henry Tirri; Peter Grünwald
In this paper we are interested in discrete prediction problems for a decision-theoretic setting, where the task is to compute the predictive distribution for a finite set of possible alternatives. This question is first addressed in a general Bayesian framework, where we consider a set of probability distributions defined by some parametric model class. Given a prior distribution on the model parameters and a set of sample data, one possible approach for determining a predictive distribution is to fix the parameters to the instantiation with the maximum a posteriori probability. A more accurate predictive distribution can be obtained by computing the evidence (marginal likelihood), i.e., the integral over all the individual parameter instantiations. As an alternative to these two approaches, we demonstrate how to use Rissanens new definition of stochastic complexity for determining predictive distributions, and show how the evidence predictive distribution with Jeffreys prior approaches the new stochastic complexity predictive distribution in the limit with increasing amount of sample data. To compare the alternative approaches in practice, each of the predictive distributions discussed is instantiated in the Bayesian network model family case. In particular, to determine Jeffreys prior for this model family, we show how to compute the (expected) Fisher information matrix for a fixed but arbitrary Bayesian network structure. In the empirical part of the paper the predictive distributions are compared by using the simple tree-structured Naive Bayes model, which is used in the experiments for computational reasons. The experimentation with several public domain classification datasets suggest that the evidence approach produces the most accurate predictions in the log-score sense. The evidence-based methods are also quite robust in the sense that they predict surprisingly well even when only a small fraction of the full training set is used.
Psychological Science | 2006
Eric-Jan Wagenmakers; Peter Grünwald
In a recent article, Killeen (2005a) proposed an alternative to traditional null-hypothesis significance testing (NHST). This alternative test is based on the statistic prep, which is the probability of replicating an effect. We share Killeen’s skepticism with respect to null-hypothesis testing, and we sympathize with the proposed conceptual shift toward issues such as replicability. One of the problems associated with NHST is that p values are prone to misinterpretation (cf. Nickerson, 2000, pp. 246– 263). Another problem with NHST is that it can provide highly misleading evidence against the null hypothesis (Killeen, 2005a, p. 345): NHST can lead one to reject the null hypothesis when there is really not enough evidence to do so. Killeen’s prep statistic successfully addresses the problem of misinterpretation, and this is a major contribution (cf. Cumming, 2005; Doros & Geier, 2005; Killeen, 2005b; Macdonald, 2005). However, the prep statistic does not remedy the second, more fundamental NHST problem mentioned by Killeen. Here we perform the standard analysis to show that prep can provide misleading evidence against the null hypothesis (cf. Berger & Sellke, 1987; Edwards, Lindman, & Savage, 1963). This analysis demonstrates the discrepancy between Bayesian hypothesis testing and prep, and highlights the necessity of considering the plausibility of both the null hypothesis and the alternative hypothesis. Consider an experiment in taste perception in which a participant has to determine which of two beverage samples contains sugar. After n trials, with s successes (i.e., correct decisions) and f failures, we wish to choose between two hypotheses: H0 (i.e., random guessing) and H1 (i.e., gustatory discriminability). For inference, we use the binomial model, in which the likelihood L(y) is proportional to y(1 y), where y denotes the probability of a correct decision on any one trial. A Bayesian hypothesis test (Jeffreys, 1961) proceeds by contrasting two quantities: the probability of the observed data D given H0 (i.e., y 1⁄4 12) and the probability of the observed data D given H1 (i.e., y 6 1⁄4 12). The ratio B01 1⁄4 pðDjH0Þ=pðDjH1Þ is the Bayes factor, and it quantifies the evidence that the data provide for H0 vis-à-vis H1. Assuming equal prior plausibility for the models, the posterior probability forH0 is given byB01=ð1þ B01Þ. In the taste perception experiment, pðDjH0Þ 1⁄4 12 n . The quantity pðDjH1Þ is more difficult to calculate, because it depends on our prior beliefs about y. Specifically, when prior knowledge of y is given by a prior distribution p(y), one obtains pðDjH1Þ by integrating L(y) over all possible values of y, weighted by the prior distribution p(y): pðDjH1Þ 1⁄4 R 1 0 LðyÞpðyÞdy. We consider two classes of priors.
Machine Learning | 2007
Peter Grünwald; John Langford
We show that forms of Bayesian and MDL inference that are often applied to classification problems can be inconsistent. This means that there exists a learning problem such that for all amounts of data the generalization errors of the MDL classifier and the Bayes classifier relative to the Bayesian posterior both remain bounded away from the smallest achievable generalization error. From a Bayesian point of view, the result can be reinterpreted as saying that Bayesian inference can be inconsistent under misspecification, even for countably infinite models. We extensively discuss the result from both a Bayesian and an MDL perspective.
conference on learning theory | 2004
Peter Grünwald; John Langford
We show that forms of Bayesian and MDL inference that are often applied to classification problems can be inconsistent. This means there exists a learning problem such that for all amounts of data the generalization errors of the MDL classifier and the Bayes classifier relative to the Bayesian posterior both remain bounded away from the smallest achievable generalization error.
algorithmic learning theory | 2012
Peter Grünwald
Standard Bayesian inference can behave suboptimally if the model is wrong. We present a modification of Bayesian inference which continues to achieve good rates with wrong models. Our method adapts the Bayesian learning rate to the data, picking the rate minimizing the cumulative loss of sequential prediction by posterior randomization. Our results can also be used to adapt the learning rate in a PAC-Bayesian context. The results are based on an extension of an inequality due to T. Zhang and others to dependent random variables.
conference on learning theory | 1999
Peter Grünwald
In order to apply the Minimum Description Length Principle, one must associate each model in the model class under consideration with a corresponding code. For probabilistic model classes, there is a principled and generally agreed-upon method for doing this; for non-probabilistic model classes (i.e. classes of functions together with associated error functions) it is not so clear how to do this. Here, we present a new method for associating codes with models that works for probabilistic and non-probabilistic model classes alike. Our method can be re-interpreted as mapping arbitrary model classes to associated classes of probability distributions. The method can therefore also be applied in a Bayesian context. In contrast to earlier proposals by Barron, Yamanishi and Rissanen and to the ad-hoc solutions found in applications of MDL, our method involveslearning the optimal scaling factor in the mapping from models to codes/probability distributions from the data at hand. We show that this method satisfies several optimality properties. We present several theorems that suggest that with the help of our mapping of models to codes, one can successfully learn using MDL and/or Bayesian methods when (1) almost arbitrary model classes and error functions are allowed, and (2) none of the models in the class under consideration are close to the ‘truth’ that generates the data.
conference on learning theory | 2005
Peter Grünwald; Steven de Rooij
We analyze the Dawid-Rissanen prequential maximum likelihood codes relative to one-parameter exponential family models M. If data are i.i.d. according to an (essentially) arbitrary P, then the redundancy grows at rate 1/2 c ln n. We show that c = σ 2 1 /σ 2 2 , where σ 2 1 is the variance of P, and σ 2 2 is the variance of the distribution M* ∈ M that is closest to P in KL divergence. This shows that prequential codes behave quite differently from other important universal codes such as the 2-part MDL, Shtarkov and Bayes codes, for which c = 1. This behavior is undesirable in an MDL model selection setting.