Vaibhava Goel | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Vaibhava Goel is active.

Explore More

Publication

Featured researches published by Vaibhava Goel.

Computer Speech & Language | 2000

Minimum bayes-risk automatic speech recognition

Vaibhava Goel; William Byrne

In this paper we address application of minimum Bayes-risk classifiers to tasks in automatic speech recognition (ASR). Minimum-risk classifiers are useful because they produce hypotheses in an attempt to be optimal under a specified task-dependent performance criterion. While the form of the optimal classifier is well known, its implementation is prohibitively expensive. We present efficient approximations that can be used to implement these procedures. In particular, anA* search over word lattices produced by a conventional ASR system is described. This algorithm is intended to extend the previously proposed N -best list rescoring approximation to minimum-risk classifiers. We provide experimental results showing that both the A*and N -best list rescoring implementations of minimum-risk classifiers yield better recognition accuracy than the commonly used maximum a posteriori probability (MAP) classifier in word transcription and identification of keywords. TheA* implementation is compared to the N -best list rescoring implementation and is found to obtain modest but significant improvements in accuracy at little additional computational cost. Another application of minimum-risk classifiers for the identification of named entities from speech is presented. Only the N -best list rescoring could be implemented for this task and was found to yield better named entity identification performance than the MAP classifier.

computer vision and pattern recognition | 2017

Self-Critical Sequence Training for Image Captioning

Steven J. Rennie; Etienne Marcheret; Youssef Mroueh; Jarret Ross; Vaibhava Goel

Recently it has been shown that policy-gradient methods for reinforcement learning can be utilized to train deep end-to-end systems directly on non-differentiable metrics for the task at hand. In this paper we consider the problem of optimizing image captioning systems using reinforcement learning, and show that by carefully optimizing our systems using the test metrics of the MSCOCO task, significant gains in performance can be realized. Our systems are built using a new optimization approach that we call self-critical sequence training (SCST). SCST is a form of the popular REINFORCE algorithm that, rather than estimating a baseline to normalize the rewards and reduce variance, utilizes the output of its own test-time inference algorithm to normalize the rewards it experiences. Using this approach, estimating the reward signal (as actor-critic methods must do) and estimating normalization (as REINFORCE algorithms typically do) is avoided, while at the same time harmonizing the model with respect to its test-time inference procedure. Empirically we find that directly optimizing the CIDEr metric with SCST and greedy decoding at test-time is highly effective. Our results on the MSCOCO evaluation sever establish a new state-of-the-art on the task, improving the best result in terms of CIDEr from 104.9 to 114.7.

IEEE Transactions on Speech and Audio Processing | 2004

Segmental minimum Bayes-risk decoding for automatic speech recognition

Vaibhava Goel; Shankar Kumar; William Byrne

Minimum Bayes-risk (MBR) speech recognizers have been shown to yield improvements over the conventional maximum a-posteriori probability (MAP) decoders through N-best list rescoring and A/sup */ search over word lattices. We present a segmental minimum Bayes-risk decoding (SMBR) framework that simplifies the implementation of MBR recognizers through the segmentation of the N-best lists or lattices over which the recognition is to be performed. This paper presents lattice cutting procedures that underly SMBR decoding. Two of these procedures are based on a risk minimization criterion while a third one is guided by word-level confidence scores. In conjunction with SMBR decoding, these lattice segmentation procedures give consistent improvements in recognition word error rate (WER) on the Switchboard corpus. We also discuss an application of risk-based lattice cutting to multiple-system SMBR decoding and show that it is related to other system combination techniques such as ROVER. This strategy combines lattices produced from multiple ASR systems and is found to give WER improvements in a Switchboard evaluation system.

IEEE Transactions on Audio, Speech, and Language Processing | 2015

Data augmentation for deep neural network acoustic modeling

Xiaodong Cui; Vaibhava Goel; Brian Kingsbury

This paper investigates data augmentation for deep neural network acoustic modeling based on label-preserving transformations to deal with data sparsity. Two data augmentation approaches, vocal tract length perturbation (VTLP) and stochastic feature mapping (SFM), are investigated for both deep neural networks (DNNs) and convolutional neural networks (CNNs). The approaches are focused on increasing speaker and speech variations of the limited training data such that the acoustic models trained with the augmented data are more robust to such variations. In addition, a two-stage data augmentation scheme based on a stacked architecture is proposed to combine VTLP and SFM as complementary approaches. Experiments are conducted on Assamese and Haitian Creole, two development languages of the IARPA Babel program, and improved performance on automatic speech recognition (ASR) and keyword search (KWS) is reported.

international conference on acoustics, speech, and signal processing | 2015

Deep multimodal learning for Audio-Visual Speech Recognition

Youssef Mroueh; Etienne Marcheret; Vaibhava Goel

In this paper, we present methods in deep multimodal learning for fusing speech and visual modalities for Audio-Visual Automatic Speech Recognition (AV-ASR). First, we study an approach where uni-modal deep networks are trained separately and their final hidden layers fused to obtain a joint feature space in which another deep network is built. While the audio network alone achieves a phone error rate (PER) of 41% under clean condition on the IBM large vocabulary audio-visual studio dataset, this fusion model achieves a PER of 35.83% demonstrating the tremendous value of the visual channel in phone classification even in audio with high signal to noise ratio. Second, we present a new deep network architecture that uses a bilinear softmax layer to account for class specific correlations between modalities. We show that combining the posteriors from the bilinear networks with those from the fused model mentioned above results in a further significant phone error rate reduction, yielding a final PER of 34.03%.

IEEE Transactions on Speech and Audio Processing | 2005

Subspace constrained Gaussian mixture models for speech recognition

Scott Axelrod; Vaibhava Goel; Ramesh A. Gopinath; Peder A. Olsen; Karthik Visweswariah

A standard approach to automatic speech recognition uses hidden Markov models whose state dependent distributions are Gaussian mixture models. Each Gaussian can be viewed as an exponential model whose features are linear and quadratic monomials in the acoustic vector. We consider here models in which the weight vectors of these exponential models are constrained to lie in an affine subspace shared by all the Gaussians. This class of models includes Gaussian models with linear constraints placed on the precision (inverse covariance) matrices (such as diagonal covariance, maximum likelihood linear transformation, or extended maximum likelihood linear transformation), as well as the LDA/HLDA models used for feature selection which tie the part of the Gaussians in the directions not used for discrimination. In this paper, we present algorithms for training these models using a maximum likelihood criterion. We present experiments on both small vocabulary, resource constrained, grammar-based tasks, as well as large vocabulary, unconstrained resource tasks to explore the rather large parameter space of models that fit within our framework. In particular, we demonstrate significant improvements can be obtained in both word error rate and computational complexity.

international conference on acoustics, speech, and signal processing | 2014

Data Augmentation for deep neural network acoustic modeling

Xiaodong Cui; Vaibhava Goel; Brian Kingsbury

Data augmentation using label preserving transformations has been shown to be effective for neural network training to make invariant predictions. In this paper we focus on data augmentation approaches to acoustic modeling using deep neural networks (DNNs) for automatic speech recognition (ASR). We first investigate a modified version of a previously studied approach using vocal tract length perturbation (VTLP) and then propose a novel data augmentation approach based on stochastic feature mapping (SFM) in a speaker adaptive feature space. Experiments were conducted on Bengali and Assamese limited language packs (LLPs) from the IARPA Babel program. Improved recognition performance has been observed after both cross-entropy (CE) and state-level minimum Bayes risk (sMBR) training of DNN models.

ieee automatic speech recognition and understanding workshop | 1997

Syllable-a promising recognition unit for LVCSR

Aravind Ganapathiraju; Vaibhava Goel; Joseph Picone; Andres Corrada; George R. Doddington; Katrin Kirchhoff; Mark Ordowski; Barbara Wheatley

We present an attempt to model syllable level acoustic information as a viable alternative to the conventional phone level acoustic unit for large vocabulary continuous speech recognition. The motivation for this work were the inherent limitations in the phone based approach, primarily the decompositional nature and lack of larger scale temporal dependencies. We present preliminary but encouraging results on a syllable based recognition system which exceeds the performance of a comparable triphone system both in terms of word error rate (WER) and complexity. The WER of the best syllable system reported here was 49.1% on a standard SWITCHBOARD evaluation.

spoken language technology workshop | 2014

Annealed dropout training of deep networks

Steven J. Rennie; Vaibhava Goel; Samuel Thomas

Recently it has been shown that when training neural networks on a limited amount of data, randomly zeroing, or “dropping out” a fixed percentage of the outputs of a given layer for each training case can improve test set performance significantly. Dropout training discourages the detectors in the network from co-adapting, which limits the capacity of the network and prevents overfitting. In this paper we show that annealing the dropout rate from a high initial value to zero over the course of training can substantially improve the quality of the resulting model. As dropout (approximately) implements model aggregation over an exponential number of networks, this procedure effectively initializes the ensemble of models that will be learned during a given iteration of training with an enemble of models that has a lower average number of neurons per network, and higher variance in the number of neurons per network-which regularizes the structure of the final model toward models that avoid unnecessary co-adaptation between neurons. Importantly, this regularization procedure is stochastic, and so promotes the learning of “balanced” networks with neurons that have high average entropy, and low variance in their entropy, by smoothly transitioning from “exploration” with high learning rates to “fine tuning” with full support for co-adaptation between neurons where necessary. Experimental results demonstrate that annealed dropout leads to significant reductions in word error rate over standard dropout training.

international conference on acoustics speech and signal processing | 1998

LVCSR rescoring with modified loss functions: a decision theoretic perspective

Vaibhava Goel; William Byrne; Sanjeev Khudanpur

The problem of speech decoding is considered in a decision theoretic framework and a modified speech decoding procedure to minimize the expected risk under a general loss function is formulated. A specific word error rate loss function is considered and an implementation in an N-best list rescoring procedure is presented. Methods for estimation of the parameters of the resulting decision rules are provided for both supervised and unsupervised training. Preliminary experiments on an LVCSR task show small but statistically significant error rate improvements.

Explore More