Tara N. Sainath | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tara N. Sainath is active.

Explore More

Publication

Featured researches published by Tara N. Sainath.

IEEE Signal Processing Magazine | 2012

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

Geoffrey E. Hinton; Li Deng; Dong Yu; George E. Dahl; Abdel-rahman Mohamed; Navdeep Jaitly; Andrew W. Senior; Vincent Vanhoucke; Patrick Nguyen; Tara N. Sainath; Brian Kingsbury

Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

international conference on acoustics, speech, and signal processing | 2013

Improving deep neural networks for LVCSR using rectified linear units and dropout

George E. Dahl; Tara N. Sainath; Geoffrey E. Hinton

Recently, pre-trained deep neural networks (DNNs) have outperformed traditional acoustic models based on Gaussian mixture models (GMMs) on a variety of large vocabulary speech recognition benchmarks. Deep neural nets have also achieved excellent results on various computer vision tasks using a random “dropout” procedure that drastically improves generalization error by randomly omitting a fraction of the hidden units in all layers. Since dropout helps avoid over-fitting, it has also been successful on a small-scale phone recognition task using larger neural nets. However, training deep neural net acoustic models for large vocabulary speech recognition takes a very long time and dropout is likely to only increase training time. Neural networks with rectified linear unit (ReLU) non-linearities have been highly successful for computer vision tasks and proved faster to train than standard sigmoid units, sometimes also improving discriminative performance. In this work, we show on a 50-hour English Broadcast News task that modified deep neural networks using ReLUs trained with dropout during frame level training provide an 4.2% relative improvement over a DNN trained with sigmoid units, and a 14.4% relative improvement over a strong GMM/HMM system. We were able to obtain our results with minimal human hyper-parameter tuning using publicly available Bayesian optimization code.

international conference on acoustics, speech, and signal processing | 2013

Deep convolutional neural networks for LVCSR

Tara N. Sainath; Abdel-rahman Mohamed; Brian Kingsbury; Bhuvana Ramabhadran

Convolutional Neural Networks (CNNs) are an alternative type of neural network that can be used to reduce spectral variations and model spectral correlations which exist in signals. Since speech signals exhibit both of these properties, CNNs are a more effective model for speech compared to Deep Neural Networks (DNNs). In this paper, we explore applying CNNs to large vocabulary speech tasks. First, we determine the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks. Specifically, we focus on how many convolutional layers are needed, what is the optimal number of hidden units, what is the best pooling strategy, and the best input feature type for CNNs. We then explore the behavior of neural network features extracted from CNNs on a variety of LVCSR tasks, comparing CNNs to DNNs and GMMs. We find that CNNs offer between a 13-30% relative improvement over GMMs, and a 4-12% relative improvement over DNNs, on a 400-hr Broadcast News and 300-hr Switchboard task.

international conference on acoustics, speech, and signal processing | 2015

Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks

Tara N. Sainath; Oriol Vinyals; Andrew W. Senior; Hasim Sak

Both Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) have shown improvements over Deep Neural Networks (DNNs) across a wide variety of speech recognition tasks. CNNs, LSTMs and DNNs are complementary in their modeling capabilities, as CNNs are good at reducing frequency variations, LSTMs are good at temporal modeling, and DNNs are appropriate for mapping features to a more separable space. In this paper, we take advantage of the complementarity of CNNs, LSTMs and DNNs by combining them into one unified architecture. We explore the proposed architecture, which we call CLDNN, on a variety of large vocabulary tasks, varying from 200 to 2,000 hours. We find that the CLDNN provides a 4-6% relative improvement in WER over an LSTM, the strongest of the three individual models.

international conference on acoustics, speech, and signal processing | 2013

Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets

Tara N. Sainath; Brian Kingsbury; Vikas Sindhwani; Ebru Arisoy; Bhuvana Ramabhadran

While Deep Neural Networks (DNNs) have achieved tremendous success for large vocabulary continuous speech recognition (LVCSR) tasks, training of these networks is slow. One reason is that DNNs are trained with a large number of training parameters (i.e., 10-50 million). Because networks are trained with a large number of output targets to achieve good performance, the majority of these parameters are in the final weight layer. In this paper, we propose a low-rank matrix factorization of the final weight layer. We apply this low-rank technique to DNNs for both acoustic modeling and language modeling. We show on three different LVCSR tasks ranging between 50-400 hrs, that a low-rank factorization reduces the number of parameters of the network by 30-50%. This results in roughly an equivalent reduction in training time, without a significant loss in final recognition accuracy, compared to a full-rank representation.

international conference on acoustics, speech, and signal processing | 2011

Deep Belief Networks using discriminative features for phone recognition

Abdel-rahman Mohamed; Tara N. Sainath; George E. Dahl; Bhuvana Ramabhadran; Geoffrey E. Hinton; Michael Picheny

Deep Belief Networks (DBNs) are multi-layer generative models. They can be trained to model windows of coefficients extracted from speech and they discover multiple layers of features that capture the higher-order statistical structure of the data. These features can be used to initialize the hidden units of a feed-forward neural network that is then trained to predict the HMM state for the central frame of the window. Initializing with features that are good at generating speech makes the neural network perform much better than initializing with random weights. DBNs have already been used successfully for phone recognition with input coefficients that are MFCCs or filterbank outputs [1, 2]. In this paper, we demonstrate that they work even better when their inputs are speaker adaptive, discriminative features. On the standard TIMIT corpus, they give phone error rates of 19.6% using monophone HMMs and a bigram language model and 19.4% using monophone HMMs and a trigram language model.

ieee automatic speech recognition and understanding workshop | 2011

Making Deep Belief Networks effective for large vocabulary continuous speech recognition

Tara N. Sainath; Brian Kingsbury; Bhuvana Ramabhadran; Petr Fousek; Petr Novák; Abdel-rahman Mohamed

To date, there has been limited work in applying Deep Belief Networks (DBNs) for acoustic modeling in LVCSR tasks, with past work using standard speech features. However, a typical LVCSR system makes use of both feature and model-space speaker adaptation and discriminative training. This paper explores the performance of DBNs in a state-of-the-art LVCSR system, showing improvements over Multi-Layer Perceptrons (MLPs) and GMM/HMMs across a variety of features on an English Broadcast News task. In addition, we provide a recipe for data parallelization of DBN training, showing that data parallelization can provide linear speed-up in the number of machines, without impacting WER.

Neural Networks | 2015

Deep Convolutional Neural Networks for Large-scale Speech Tasks

Tara N. Sainath; Brian Kingsbury; George Saon; Hagen Soltau; Abdel-rahman Mohamed; George E. Dahl; Bhuvana Ramabhadran

Convolutional Neural Networks (CNNs) are an alternative type of neural network that can be used to reduce spectral variations and model spectral correlations which exist in signals. Since speech signals exhibit both of these properties, we hypothesize that CNNs are a more effective model for speech compared to Deep Neural Networks (DNNs). In this paper, we explore applying CNNs to large vocabulary continuous speech recognition (LVCSR) tasks. First, we determine the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks. Specifically, we focus on how many convolutional layers are needed, what is an appropriate number of hidden units, what is the best pooling strategy. Second, investigate how to incorporate speaker-adapted features, which cannot directly be modeled by CNNs as they do not obey locality in frequency, into the CNN framework. Third, given the importance of sequence training for speech tasks, we introduce a strategy to use ReLU+dropout during Hessian-free sequence training of CNNs. Experiments on 3 LVCSR tasks indicate that a CNN with the proposed speaker-adapted and ReLU+dropout ideas allow for a 12%-14% relative improvement in WER over a strong DNN system, achieving state-of-the art results in these 3 tasks.

international conference on acoustics, speech, and signal processing | 2012

Auto-encoder bottleneck features using deep belief networks

Tara N. Sainath; Brian Kingsbury; Bhuvana Ramabhadran

Neural network (NN) bottleneck (BN) features are typically created by training a NN with a middle bottleneck layer. Recently, an alternative structure was proposed which trains a NN with a constant number of hidden units to predict output targets, and then reduces the dimensionality of these output probabilities through an auto-encoder, to create auto-encoder bottleneck (AE-BN) features. The benefit of placing the BN after the posterior estimation network is that it avoids the loss in frame classification accuracy incurred by networks that place the BN before the softmax. In this work, we investigate the use of pre-training when creating AE-BN features. Our experiments indicate that with the AE-BN architecture, pre-trained and deeper NNs produce better AE-BN features. On a 50-hour English Broadcast News task, the AE-BN features provide over a 1% absolute improvement compared to a state-of-the-art GMM/HMM with a WER of 18.8% and pre-trained NN hybrid system with a WER of 18.4%. In addition, on a larger 430-hour Broadcast News task, AE-BN features provide a 0.5% absolute improvement over a strong GMM/HMM baseline with a WER of 16.0%. Finally, system combination with the GMM/HMM baseline and AE-BN systems provides an additional 0.5% absolute on 430 hours over the AE-BN system alone, yielding a final WER of 15.0%.

ieee automatic speech recognition and understanding workshop | 2013

Improvements to Deep Convolutional Neural Networks for LVCSR

Tara N. Sainath; Brian Kingsbury; Abdel-rahman Mohamed; George E. Dahl; George Saon; Hagen Soltau; Tomas Beran; Aleksandr Y. Aravkin; Bhuvana Ramabhadran

Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.

Explore More