Radford M. Neal
University of Toronto
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Radford M. Neal.
Communications of The ACM | 1987
Ian H. Witten; Radford M. Neal; John G. Cleary
The state of the art in data compression is arithmetic coding, not the better-known Huffman method. Arithmetic coding gives greater compression, is faster for adaptive models, and clearly separates the model from the channel encoding.
Proceedings of the NATO Advanced Study Institute on Learning in graphical models | 1998
Radford M. Neal; Geoffrey E. Hinton
The EM algorithm performs maximum likelihood estimation for data in which some variables are unobserved. We present a function that resembles negative free energy and show that the M step maximizes this function with respect to the model parameters and the E step maximizes it with respect to the distribution over the unobserved variables. From this perspective, it is easy to justify an incremental variant of the EM algorithm in which the distribution for only one of the unobserved variables is recalculated in each E step. This variant is shown empirically to give faster convergence in a mixture estimation problem. A variant of the algorithm that exploits sparse conditional distributions is also described, and a wide range of other variant algorithms are also seen to be possible.
Journal of Computational and Graphical Statistics | 2000
Radford M. Neal
Abstract This article reviews Markov chain methods for sampling from the posterior distribution of a Dirichlet process mixture model and presents two new classes of methods. One new approach is to make Metropolis—Hastings updates of the indicators specifying which mixture component is associated with each observation, perhaps supplemented with a partial form of Gibbs sampling. The other new approach extends Gibbs sampling for these indicators by using a set of auxiliary parameters. These methods are simple to implement and are more efficient than previous ways of handling general Dirichlet process mixture models with non-conjugate priors.
Neural Computation | 1999
Peter Dayan; Geoffrey E. Hinton; Radford M. Neal; Richard S. Zemel
Discovering the structure inherent in a set of patterns is a fundamental aim of statistical inference or learning. One fruitful approach is to build a parameterized stochastic generative model, independent draws from which are likely to produce the patterns. For all but the simplest generative models, each pattern can be generated in exponentially many ways. It is thus intractable to adjust the parameters to maximize the probability of the observed patterns. We describe a way of finessing this combinatorial explosion by maximizing an easily computed lower bound on the probability of the observations. Our method can be viewed as a form of hierarchical self-supervised learning that may relate to the function of bottom-up and top-down cortical processing pathways.
Statistics and Computing | 2001
Radford M. Neal
Simulated annealing—moving from a tractable distribution to a distribution of interest via a sequence of intermediate distributions—has traditionally been used as an inexact method of handling isolated modes in Markov chain samplers. Here, it is shown how one can use the Markov chain transitions for such an annealing sequence to define an importance sampler. The Markov chain aspect allows this method to perform acceptably even for high-dimensional problems, where finding good importance sampling distributions would otherwise be very difficult, while the use of importance weights ensures that the estimates found converge to the correct values as the number of annealing runs increases. This annealed importance sampling procedure resembles the second half of the previously-studied tempered transitions, and can be seen as a generalization of a recently-proposed variant of sequential importance sampling. It is also related to thermodynamic integration methods for estimating ratios of normalizing constants. Annealed importance sampling is most attractive when isolated modes are present, or when estimates of normalizing constants are required, but it may also be more generally useful, since its independent sampling allows one to bypass some of the problems of assessing convergence and autocorrelation in Markov chain samplers.
The American Statistician | 1998
Robert E. Kass; Bradley P. Carlin; Andrew Gelman; Radford M. Neal
Abstract Markov chain Monte Carlo (MCMC) methods make possible the use of flexible Bayesian models that would otherwise be computationally infeasible. In recent years, a great variety of such applications have been described in the literature. Applied statisticians who are new to these methods may have several questions and concerns, however: How much effort and expertise are needed to design and use a Markov chain sampler? How much confidence can one have in the answers that MCMC produces? How does the use of MCMC affect the rest of the model-building process? At the Joint Statistical Meetings in August, 1996, a panel of experienced MCMC users discussed these and other issues, as well as various “tricks of the trade” This article is an edited recreation of that discussion. Its purpose is to offer advice and guidance to novice users of MCMC—and to not-so-novice users as well. Topics include building confidence in simulation results, methods for speeding and assessing convergence, estimating standard error...
Artificial Intelligence | 1992
Radford M. Neal
Abstract Connectionist learning procedures are presented for “sigmoid” and “noisy-OR” varieties of probabilistic belief networks. These networks have previously been seen primarily as a means of representing knowledge derived from experts. Here it is shown that the “Gibbs sampling” simulation procedure for such networks can support maximum-likelihood learning from empirical data through local gradient ascent. This learning procedure resembles that used for “Boltzmann machines”, and like it, allows the use of “hidden” variables to model correlations between visible variables. Due to the directed nature of the connections in a belief network, however, the “negative phase” of Boltzmann machine learning is unnecessary. Experimental results show that, as a result, learning in a sigmoid belief network can be faster than in a Boltzmann machine. These networks have other advantages over Boltzmann machines in pattern classification and decision making applications, are naturally applicable to unsupervised learning problems, and provide a link between work on connectionist learning and work on the representation of expert knowledge.
Journal of Computational and Graphical Statistics | 2004
Sonia Jain; Radford M. Neal
This article proposes a split-merge Markov chain algorithm to address the problem of inefficient sampling for conjugate Dirichlet process mixture models. Traditional Markov chain Monte Carlo methods for Bayesian mixture models, such as Gibbs sampling, can become trapped in isolated modes corresponding to an inappropriate clustering of data points. This article describes a Metropolis-Hastings procedure that can escape such local modes by splitting or merging mixture components. Our algorithm employs a new technique in which an appropriate proposal for splitting or merging components is obtained by using a restricted Gibbs sampling scan. We demonstrate empirically that our method outperforms the Gibbs sampler in situations where two or more components are similar in structure.
Statistics and Computing | 1996
Radford M. Neal
I present a new Markov chain sampling method appropriate for distributions with isolated modes. Like the recently developed method of ‘simulated tempering’, the ‘tempered transition’ method uses a series of distributions that interpolate between the distribution of interest and a distribution for which sampling is easier. The new method has the advantage that it does not require approximate values for the normalizing constants of these distributions, which are needed for simulated tempering, and can be tedious to estimate. Simulated tempering performs a random walk along the series of distributions used. In contrast, the tempered transitions of the new method move systematically from the desired distribution, to the easily-sampled distribution, and back to the desired distribution. This systematic movement avoids the inefficiency of a random walk, an advantage that is unfortunately cancelled by an increase in the number of interpolating distributions required. Because of this, the sampling efficiency of the tempered transition method in simple problems is similar to that of simulated tempering. On more complex distributions, however, simulated tempering and tempered transitions may perform differently. Which is better depends on the ways in which the interpolating distributions are ‘deceptive’.
Archive | 1992
Radford M. Neal
It is shown that Bayesian inference from data modeled by a mixture distribution can feasibly be performed via Monte Carlo simulation. This method exhibits the true Bayesian predictive distribution, implicitly integrating over the entire underlying parameter space. An infinite number of mixture components can be accommodated without difficulty, using a prior distribution for mixing proportions that selects a reasonable subset of components to explain any finite training set. The need to decide on a “correct” number of components is thereby avoided. The feasibility of the method is shown empirically for a simple classification task.