Yee Whye Teh | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yee Whye Teh is active.

Explore More

Publication

Featured researches published by Yee Whye Teh.

Neural Computation | 2006

A fast learning algorithm for deep belief nets

Geoffrey E. Hinton; Simon Osindero; Yee Whye Teh

We show how to use complementary priors to eliminate the explaining-away effects that make inference difficult in densely connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights using a contrastive version of the wake-sleep algorithm. After fine-tuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modeled by long ravines in the free-energy landscape of the top-level associative memory, and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind.

Journal of the American Statistical Association | 2006

Hierarchical Dirichlet Processes

Yee Whye Teh; Michael I. Jordan; Matthew J. Beal; David M. Blei

We consider problems involving groups of data where each observation within a group is a draw from a mixture model and where it is desirable to share mixture components between groups. We assume that the number of mixture components is unknown a priori and is to be inferred from the data. In this setting it is natural to consider sets of Dirichlet processes, one for each group, where the well-known clustering property of the Dirichlet process provides a nonparametric prior for the number of mixture components within each group. Given our desire to tie the mixture models in the various groups, we consider a hierarchical model, specifically one in which the base measure for the child Dirichlet processes is itself distributed according to a Dirichlet process. Such a base measure being discrete, the child Dirichlet processes necessarily share atoms. Thus, as desired, the mixture models in the different groups necessarily share mixture components. We discuss representations of hierarchical Dirichlet processes in terms of a stick-breaking process, and a generalization of the Chinese restaurant process that we refer to as the “Chinese restaurant franchise.” We present Markov chain Monte Carlo algorithms for posterior inference in hierarchical Dirichlet process mixtures and describe applications to problems in information retrieval and text modeling.

meeting of the association for computational linguistics | 2006

A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes

Yee Whye Teh

We propose a new hierarchical Bayesian n-gram model of natural languages. Our model makes use of a generalization of the commonly used Dirichlet distributions called Pitman-Yor processes which produce power-law distributions more closely resembling those in natural languages. We show that an approximation to the hierarchical Pitman-Yor language model recovers the exact formulation of interpolated Kneser-Ney, one of the best smoothing methods for n-gram language models. Experiments verify that our model gives cross entropy results superior to interpolated Kneser-Ney and comparable to modified Kneser-Ney.

Archive | 2010

Bayesian Nonparametrics: Hierarchical Bayesian nonparametric models with applications

Yee Whye Teh; Michael I. Jordan

Hierarchical modeling is a fundamental concept in Bayesian statistics. The basic idea is that parameters are endowed with distributions which may themselves introduce new parameters, and this construction recurses. In this review we discuss the role of hierarchical modeling in Bayesian nonparametrics, focusing on models in which the infinite-dimensional parameters are treated hierarchically. For example, we consider a model in which the base measure for a Dirichlet process is itself treated as a draw from another Dirichlet process. This yields a natural recursion that we refer to as a hierarchical Dirichlet process. We also discuss hierarchies based on the Pitman-Yor process and on completely random processes. We demonstrate the value of these hierarchical constructions in a wide range of practical applications, in problems in computational biology, computer vision and natural language processing.

international conference on machine learning | 2008

Beam sampling for the infinite hidden Markov model

Jurgen Van Gael; Yunus Saatci; Yee Whye Teh; Zoubin Ghahramani

The infinite hidden Markov model is a non-parametric extension of the widely used hidden Markov model. Our paper introduces a new inference algorithm for the infinite Hidden Markov model called beam sampling. Beam sampling combines slice sampling, which limits the number of states considered at each time step to a finite number, with dynamic programming, which samples whole state trajectories efficiently. Our algorithm typically outperforms the Gibbs sampler and is more robust. We present applications of iHMM inference using the beam sampler on changepoint detection and text prediction problems.

international conference on machine learning | 2009

A stochastic memoizer for sequence data

Frank D. Wood; Cédric Archambeau; Jan Gasthaus; Lancelot F. James; Yee Whye Teh

We propose an unbounded-depth, hierarchical, Bayesian nonparametric model for discrete sequence data. This model can be estimated from a single training sequence, yet shares statistical strength between subsequent symbol predictive distributions in such a way that predictive performance generalizes well. The model builds on a specific parameterization of an unbounded-depth hierarchical Pitman-Yor process. We introduce analytic marginalization steps (using coagulation operators) to reduce this model to one that can be represented in time and space linear in the length of the training sequence. We show how to perform inference in such a model without truncation approximation and introduce fragmentation operators necessary to do predictive inference. We demonstrate the sequence memoizer by using it as a language model, achieving state-of-the-art results.

Statistical Science | 2013

MCMC for Normalized Random Measure Mixture Models

Stefano Favaro; Yee Whye Teh

This paper concerns the use of Markov chain Monte Carlo methods for posterior sampling in Bayesian nonparametric mixture mod- els with normalized random measure priors. Making use of some recent posterior characterizations for the class of normalized random measures, we propose novel Markov chain Monte Carlo methods of both marginal type and conditional type. The proposed marginal samplers are general- izations of Neals well-regarded Algorithm 8 for Dirichlet process mixture models, whereas the conditional sampler is a variation of those recently introduced in the literature. For both the marginal and conditional meth- ods, we consider as a running example a mixture model with an underly- ing normalized generalized Gamma process prior, and describe compara- tive simulation results demonstrating the efficacies of the proposed meth- ods.

Artificial Intelligence | 2003

Approximate inference in Boltzmann machines

Max Welling; Yee Whye Teh

Inference in Boltzmann machines is NP-hard in general. As a result approximations are often necessary. We discuss first order mean field and second order Onsager truncations of the Plefka expansion of the Gibbs free energy. The Bethe free energy is introduced and rewritten as a Gibbs free energy. From there a convergent belief optimization algorithm is derived to minimize the Bethe free energy. An analytic expression for the linear response estimate of the covariances is found which is exact on Boltzmann trees. Finally, a number of theorems is proven concerning the Plefka expansion, relating the first order mean field and the second order Onsager approximation to the Bethe approximation. Experiments compare mean field approximation, Onsager approximation, belief propagation and belief optimization.

meeting of the association for computational linguistics | 2007

NUS-ML:Improving Word Sense Disambiguation Using Topic Features

Jun Fu Cai; Wee Sun Lee; Yee Whye Teh

We participated in SemEval-1 English coarse-grained all-words task (task 7), English fine-grained all-words task (task 17, subtask 3) and English coarse-grained lexical sample task (task 17, subtask 1). The same method with different labeled data is used for the tasks; SemCor is the labeled corpus used to train our system for the all-words tasks while the labeled corpus that is provided is used for the lexical sample task. The knowledge sources include part-of-speech of neighboring words, single words in the surrounding context, local collocations, and syntactic patterns. In addition, we constructed a topic feature, targeted to capture the global context information, using the latent dirichlet allocation (LDA) algorithm with unlabeled corpus. A modified naive Bayes classifier is constructed to incorporate all the features. We achieved 81.6%, 57.6%, 88.7% for coarse-grained all-words task, fine-grained all-words task and coarse-grained lexical sample task respectively.

Communications of The ACM | 2011

The sequence memoizer

Frank D. Wood; Jan Gasthaus; Cédric Archambeau; Lancelot F. James; Yee Whye Teh

Probabilistic models of sequences play a central role in most machine translation, automated speech recognition, lossless compression, spell-checking, and gene identification applications to name but a few. Unfortunately, real-world sequence data often exhibit long range dependencies which can only be captured by computationally challenging, complex models. Sequence data arising from natural processes also often exhibits power-law properties, yet common sequence models do not capture such properties. The sequence memoizer is a new hierarchical Bayesian model for discrete sequence data that captures long range dependencies and power-law characteristics, while remaining computationally attractive. Its utility as a language model and general purpose lossless compressor is demonstrated.

Explore More