Steven J. Rennie | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Steven J. Rennie is active.

Explore More

Publication

Featured researches published by Steven J. Rennie.

Computer Speech & Language | 2010

Super-human multi-talker speech recognition: A graphical modeling approach

John R. Hershey; Steven J. Rennie; Peder A. Olsen; Trausti Kristjansson

We present a system that can separate and recognize the simultaneous speech of two people recorded in a single channel. Applied to the monaural speech separation and recognition challenge, the system out-performed all other participants -including human listeners - with an overall recognition error rate of 21.6%, compared to the human error rate of 22.3%. The system consists of a speaker recognizer, a model-based speech separation module, and a speech recognizer. For the separation models we explored a range of speech models that incorporate different levels of constraints on temporal dynamics to help infer the source speech signals. The system achieves its best performance when the model of temporal dynamics closely captures the grammatical constraints of the task. For inference, we compare a 2-D Viterbi algorithm and two loopy belief-propagation algorithms. We show how belief-propagation reduces the complexity of temporal inference from exponential to linear in the number of sources and the size of the language model. The best belief-propagation method results in nearly the same recognition error rate as exact inference.

computer vision and pattern recognition | 2017

Self-Critical Sequence Training for Image Captioning

Steven J. Rennie; Etienne Marcheret; Youssef Mroueh; Jarret Ross; Vaibhava Goel

Recently it has been shown that policy-gradient methods for reinforcement learning can be utilized to train deep end-to-end systems directly on non-differentiable metrics for the task at hand. In this paper we consider the problem of optimizing image captioning systems using reinforcement learning, and show that by carefully optimizing our systems using the test metrics of the MSCOCO task, significant gains in performance can be realized. Our systems are built using a new optimization approach that we call self-critical sequence training (SCST). SCST is a form of the popular REINFORCE algorithm that, rather than estimating a baseline to normalize the rewards and reduce variance, utilizes the output of its own test-time inference algorithm to normalize the rewards it experiences. Using this approach, estimating the reward signal (as actor-critic methods must do) and estimating normalization (as REINFORCE algorithms typically do) is avoided, while at the same time harmonizing the model with respect to its test-time inference procedure. Empirically we find that directly optimizing the CIDEr metric with SCST and greedy decoding at test-time is highly effective. Our results on the MSCOCO evaluation sever establish a new state-of-the-art on the task, improving the best result in terms of CIDEr from 104.9 to 114.7.

IEEE Signal Processing Magazine | 2010

Single-Channel Multitalker Speech Recognition

Steven J. Rennie; John R. Hershey; Peder A. Olsen

We have described some of the problems with modeling mixed acoustic signals in the log spectral domain using graphical models, as well as some current approaches to handling these problems for multitalker speech separation and recognition. We have also reviewed methods for inference on FHMMs (factorial hidden Markov model) and methods for handling the nonlinear interaction function in the log spectral domain. These methods are capable of separating and recognizing speech better than human listeners on the SSC task.

spoken language technology workshop | 2014

Annealed dropout training of deep networks

Steven J. Rennie; Vaibhava Goel; Samuel Thomas

Recently it has been shown that when training neural networks on a limited amount of data, randomly zeroing, or “dropping out” a fixed percentage of the outputs of a given layer for each training case can improve test set performance significantly. Dropout training discourages the detectors in the network from co-adapting, which limits the capacity of the network and prevents overfitting. In this paper we show that annealing the dropout rate from a high initial value to zero over the course of training can substantially improve the quality of the resulting model. As dropout (approximately) implements model aggregation over an exponential number of networks, this procedure effectively initializes the ensemble of models that will be learned during a given iteration of training with an enemble of models that has a lower average number of neurons per network, and higher variance in the number of neurons per network-which regularizes the structure of the final model toward models that avoid unnecessary co-adaptation between neurons. Importantly, this regularization procedure is stochastic, and so promotes the learning of “balanced” networks with neurons that have high average entropy, and low variance in their entropy, by smoothly transitioning from “exploration” with high learning rates to “fine tuning” with full support for co-adaptation between neurons where necessary. Experimental results demonstrate that annealed dropout leads to significant reductions in word error rate over standard dropout training.

international conference on acoustics, speech, and signal processing | 2006

Dynamic Noise Adaptation

Steven J. Rennie; Trausti T. Kristjansson; Peder A. Olsen; Ramesh A. Gopinath

We consider the problem of robust speech recognition in the car environment. We present a new dynamic noise adaptation algorithm, called DNA, for the robust front-end compensation of evolving semi-stationary noise as typically encountered in the car setting. A large dataset of in-car noise was collected for the evaluation of the new algorithm. This dataset was combined with the Aurora II framework to produce a new, publicly available framework, called DNA + AURORA II, for the evaluation of adaptive noise compensation algorithms. We show that DNA consistently outperforms several existing, related state-of-the-art front-end denoising techniques

international conference on acoustics, speech, and signal processing | 2008

Efficient model-based speech separation and denoising using non-negative subspace analysis

Steven J. Rennie; John R. Hershey; Peder A. Olsen

We present a new probabilistic architecture for analyzing composite non-negative data, called Non-negative Subspace Analysis (NSA). The NSA model provides a framework for understanding the relationships between sparse subspace and mixture model based approaches, and encompasses a range of models, including Sparse Non-negative Matrix Factorization (SNMF) [1] and mixture-model based analysis as special cases. We present a convenient instantiation of the NSA model, and an efficient variational approximate learning and inference algorithm that combines the advantages of SNMF and mixture model-based approaches. Preliminary recognition results on the Pascal Speech Separation Challenge 2006 test set [2], based on NSA separation results, are presented. The results fall short of those achieved by Algonquin [3], a state-of-the-art mixture-model based method, but considering that NSA runs an order of magnitude faster, the results are impressive. NSA outperforms SNMF in terms of word error rate (WER) on the task by a significant margin of over 9% absolute.

international conference on acoustics, speech, and signal processing | 2009

Single-channel speech separation and recognition using loopy belief propagation

Steven J. Rennie; John R. Hershey; Peder A. Olsen

We address the problem of single-channel speech separation and recognition using loopy belief propagation in a way that enables efficient inference for an arbitrary number of speech sources. The graphical model consists of a set of N Markov chains, each of which represents a language model or grammar for a given speaker. A Gaussian mixture model with shared states is used to model the hidden acoustic signal for each grammar state of each source. The combination of sources is modeled in the log spectrum domain using non-linear interaction functions. Previously, temporal inference in such a model has been performed using an N-dimensional Viterbi algorithm that scales exponentially with the number of sources. In this paper, we describe a loopy message passing algorithm that scales linearly with language model size. The algorithm achieves human levels of performance, and is an order of magnitude faster than competitive systems for two speakers.

ieee automatic speech recognition and understanding workshop | 2011

Sparse Maximum A Posteriori adaptation

Peder A. Olsen; Jing Huang; Vaibhava Goel; Steven J. Rennie

Maximum A Posteriori (MAP) adaptation is a powerful tool for building speaker specific acoustic models. Modern speech applications utilize acoustic models with millions of parameters, and serve millions of users. Storing an acoustic model for each user in such settings is costly. However, speaker specific acoustic models are generally similar to the acoustic model being adapted. By imposing sparseness constraints, we can save significantly on storage, and even improve the quality of the resulting speaker-dependent model. In this paper we utilize the ℓ1 or ℓ0 norm as a regularizer to induce sparsity. We show that we can obtain up to 95% sparsity with negligible loss in recognition accuracy, with both penalties. By removing small differences, which constitute “adaptation noise”, sparse MAP is actually able to improve upon MAP adaptation. Sparse MAP reduces the MAP word error rate by 2% relative at 89% sparsity.

ieee automatic speech recognition and understanding workshop | 2007

Variational Kullback-Leibler divergence for Hidden Markov models

John R. Hershey; Peder A. Olsen; Steven J. Rennie

Divergence measures are widely used tools in statistics and pattern recognition. The Kullback-Leibler (KL) divergence between two hidden Markov models (HMMs) would be particularly useful in the fields of speech and image recognition. Whereas the KL divergence is tractable for many distributions, including Gaussians, it is not in general tractable for mixture models or HMMs. Recently, variational approximations have been introduced to efficiently compute the KL divergence and Bhattacharyya divergence between two mixture models, by reducing them to the divergences between the mixture components. Here we generalize these techniques to approach the divergence between HMMs using a recursive backward algorithm. Two such methods are introduced, one of which yields an upper bound on the KL divergence, the other of which yields a recursive closed-form solution. The KL and Bhattacharyya divergences, as well as a weighted edit-distance technique, are evaluated for the task of predicting the confusability of pairs of words.

spoken language technology workshop | 2014

Deep Order Statistic Networks

Steven J. Rennie; Vaibhava Goel; Samuel Thomas

Recently, Maxout networks have demonstrated state-of-the-art performance on several machine learning tasks, which has fueled aggressive research on Maxout networks and generalizations thereof. In this work, we propose the utilization of order statistics as a generalization of the max non-linearity. A particularly general example of an order-statistic non-linearity is the “sortout” non-linearity, which outputs all input activations, but in sorted order. Such Order-statistic networks (OSNs), in contrast with other recently proposed generalizations of Maxout networks, leave the determination of the interpolation weights on the activations to the network, and remain conditionally linear given the input, and so are well suited for powerful model aggregation techniques such as dropout, drop connect, and annealed dropout. Experimental results demonstrate that the use of order statistics rather than Maxout networks can lead to substantial improvements in the word error rate (WER) performance of automatic speech recognition systems.

Explore More