Is this you? Create Your Porfile

Suman V. Ravuri

International Computer Science Institute

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Suman V. Ravuri is active.

Explore More

Publication

Featured researches published by Suman V. Ravuri.

international conference on acoustics, speech, and signal processing | 2010

The IBM 2008 GALE Arabic speech transcription system

Brian Kingsbury; Hagen Soltau; George Saon; Stephen M. Chu; Hong-Kwang Kuo; Lidia Mangu; Suman V. Ravuri; Nelson Morgan; Adam Janin

This paper describes the Arabic broadcast transcription system fielded by IBM in the GALE Phase 3.5 machine translation evaluation. Key advances compared to our Phase 2.5 system include improved discriminative training, the use of Subspace Gaussian Mixture Models (SGMM), neural network acoustic features, variable frame rate decoding, training data partitioning experiments, unpruned n-gram language models and neural network language models. These advances were instrumental in achieving a word error rate of 8.9% on the evaluation test set.

international conference on acoustics, speech, and signal processing | 2011

Comparing multilayer perceptron to Deep Belief Network Tandem features for robust ASR

Oriol Vinyals; Suman V. Ravuri

In this paper, we extend the work done on integrating multilayer perceptron (MLP) networks with HMM systems via the Tandem approach. In particular, we explore whether the use of Deep Belief Networks (DBN) adds any substantial gain over MLPs on the Aurora2 speech recognition task under mismatched noise conditions. Our findings suggest that DBNs outperform single layer MLPs under the clean condition, but the gains diminish as the noise level is increased. Furthermore, using MFCCs in conjunction with the posteriors from DBNs outperforms merely using single DBNs in low to moderate noise conditions. MFCCs, however, do not help for the high noise settings.

international conference on acoustics, speech, and signal processing | 2010

Cover song detection: From high scores to general classification

Suman V. Ravuri; Daniel P. W. Ellis

Existing cover song detection systems require prior knowledge of the number of cover songs in a test set in order to identify cover(s) to a reference song. We describe a system that does not require such prior knowledge. The input to the system is a reference track and test track, and the output is a binary classification of whether the inputs are either a reference and a cover or a reference and a non-cover. The system differs from state-of-the-art detectors by calculating multiple input features, performing a novel type of test song normalization in order to combat against “impostor” tracks, and performing classification using either a support vector machine (SVM) or multi-layer perceptron (MLP). On the covers80 test set, the system achieves an equal error rate of 10%, compared to 21.3% achieved by the 2007 LabROSA cover song detection system.

ieee automatic speech recognition and understanding workshop | 2015

A comparative study of neural network models for lexical intent classification

Suman V. Ravuri; Andreas Stoicke

Domain and intent classification are critical pre-processing steps for many speech understanding and dialog systems, as it allows for certain types of utterances to be routed to particular subsystems. In previous work, we explored many types of neural network (NN) architectures - some feedforward and some recurrent - for lexical intent classification and found that they improved upon more traditional statistical baselines. In this paper we carry out a more comprehensive comparison of NN models including the recently proposed gated recurrent unit network, for two domain/intent classification tasks. Furthermore, whereas the previous work was confined to relatively small and controlled datasets, we now include experiments based on a large set obtained from the Cortana personal assistant application. We compare feedforward, recurrent, and gated - such as LSTM and GRU - networks against each other. On both the ATIS intent task and the much larger Cortana domain classification tasks, gated networks outperform recurrent models, which in turn outperform feedforward networks. Also, we compared standard word vector models against a representation which encodes words as sets of character n-grams to mitigate the out-of-vocabulary problem. We find that in nearly all cases, the standard word vectors outperform character-based word representations. Best results are obtained by linearly combining scores from NN models with log likelihood ratios obtained from N-gram language models.

international conference on acoustics, speech, and signal processing | 2016

A comparative study of recurrent neural network models for lexical domain classification

Suman V. Ravuri; Andreas Stolcke

Domain classification is a critical pre-processing step for many speech understanding and dialog systems, as it allows for certain types of utterances to be routed to specialized subsystems. In previous work, we explored various neural network (NN) architectures for binary utterance classification based on lexical features, and found that they improved upon more traditional statistical baselines. In this paper we generalize to an n-way classification task, and test the best-performing NN architectures on a large, real-world dataset from the Cortana personal assistant application. As in the earlier work, we find that recurrent NNs with gated memory units (LSTM and GRU) perform best, beating out state-of-the-art baseline systems based on language models or boosting classifiers. NN classifiers can still benefit from combining their posterior class estimates with traditional language model likelihood ratios, via a logistic regression combiner. We also investigate whether it is better to use an ensemble of binary classifiers or a NN trained for n-way classification, and how each approach performs in combination with the baseline classifiers. The best overall results are obtained by first combining an ensemble of binary GRU-NN classifiers with LM likelihood ratios, followed by picking the highest class posterior estimate.

international conference on acoustics, speech, and signal processing | 2012

Easy does it: Robust spectro-temporal many-stream ASR without fine tuning streams

Suman V. Ravuri; Nelson Morgan

Previous work has shown that spectro-temporal features reduce the word error rate for automatic speech recognition under noisy conditions. These systems, however, required significant hand-tuning in order to determine which spectral and temporal modulations should be included in a particular stream. In this work, streams are split into one spectral and temporal modulation each and their posterior probabilities are combined once each stream is discriminatively trained via multilayer perceptron. We show that this combination structure performs as well or better than more elaborate methods in which multiple spectral and temporal modulations are hand-picked per stream. In addition, these type of features outperform standard noise-robust features such as the “Advanced Front End” features, whereas our hand-picked spectro-temporal features do not.

IEEE Transactions on Audio, Speech, and Language Processing | 2011

Transcribing Mandarin Broadcast Speech Using Multi-Layer Perceptron Acoustic Features

Fabio Valente; Mathew Magimai.-Doss; Christian Plahl; Suman V. Ravuri; Wen Wang

Recently, several multi-layer perceptron (MLP)-based front-ends have been developed and used for Mandarin speech recognition, often showing significant complementary properties to conventional spectral features. Although widely used in multiple Mandarin systems, no systematic comparison of all the different approaches as well as their scalability has been proposed. The novelty of this correspondence is mainly experimental. In this work, all the MLP front-ends recently developed at multiple sites are described and compared in a systematic manner on a 100 hours setup. The study covers the two main directions along which the MLP features have evolved: the use of different input representations to the MLP and the use of more complex MLP architectures beyond the three-layer perceptron. The results are analyzed in terms of confusion matrices and the paper discusses a number of novel findings that the comparison reveals. Furthermore, the two best front-ends used in the GALE 2008 evaluation, referred as MLP1 and MLP2, are studied in a more complex LVCSR system in order to investigate their scalability in terms of the amount of training data (from 100 hours to 1600 hours) and the parametric system complexity (maximum likelihood versus discriminative training, speaker adaptative training, lattice level combination). Results on 5 hours of evaluation data from the GALE project reveal that the MLP features consistently produce improvements in the range of 15%-23% relative at the different steps of a multipass system when compared to mel-frequency cepstral coefficient (MFCC) and PLP features, suggesting that the improvements scale with the amount of data and with the complexity of the system. The integration of those features into the GALE 2008 evaluation system provide very competitive performances compared to other Mandarin systems.

ieee automatic speech recognition and understanding workshop | 2015

Hybrid DNN-Latent structured SVM acoustic models for continuous speech recognition

Suman V. Ravuri

In this work, we propose Deep Neural Network (DNN)-Latent Structured Support Vector Machine (LSSVM) Acoustic Models as replacement for more standard sequence-discriminative trained DNN-HMM hybrid acoustic models. Compared to existing methods, approaches based on margin maximization, as is considered in this work, enjoy better theoretical justification. In addition to a max-margin based criteria, we also extend the Structured SVM model to include latent variables in the model to account for uncertainty in state alignments. Introducing latent structure allows for better sample complexity, often requiring 33% to 66% fewer utterances to converge compared to alternate criteria. On an 8-hour independent test set of conversational speech, the proposed method decreases word error rate by 9% relative to a cross-entropy trained hybrid system, while the best existing system decreases the word error rate by 6.5% relative.

international conference on acoustics, speech, and signal processing | 2016

How neural network features and depth modify statistical properties of HMM acoustic models

Suman V. Ravuri; Steven Wegmann

Tandem neural network features, especially ones trained with more than one hidden layer, have improved word recognition performance, but why these features improve automatic speech recognition systems is not completely understood. In this work, we study how neural network features cope with the mismatch between the underlying stochastic process inherent in speech, and the models we use to represent that process. We use a novel resampling framework, which re-samples test set data to match the conditional independence assumptions of the acoustic model, and measure performance as we break those assumptions. We discover that depth provides modest robustness to data/model mismatch at the state level, and compared to standard MFCC features, neural network features actually fix poor duration modeling assumptions of the HMM. The duration modeling problem is also fixed by the language model, suggesting that the dictionary and language model make very strong implicit assumptions about phone length, which may now need to be revisited.

conference of the international speech communication association | 2016

How Neural Network Depth Compensates for HMM Conditional Independence Assumptions in DNN-HMM Acoustic Models.

Suman V. Ravuri; Steven Wegmann

While DNN-HMM acoustic models have replaced GMMHMMs in the standard ASR pipeline due to performance improvements, one unrealistic assumption that remains in these models is the conditional independence assumption of the Hidden Markov Model (HMM). In this work, we explore the extent to which depth of neural networks helps compensate for these poor conditional independence assumptions. Using a bootstrap resampling framework that allows us to control the amount of data dependence in the test set while still using real observations from the data, we can determine how robust neural networks, and particularly deeper models, are to data dependence. Our conclusions are that if the data were to match the conditional independence assumptions of the HMM, there would be little benefit from using deeper models. It is only when data become more dependent that depth improves ASR performance. That performance substantially degrades, however, as the data becomes more realistic suggests that better temporal modeling is still needed for ASR.

Explore More