Garimella S. V. S. Sivaram
Johns Hopkins University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Garimella S. V. S. Sivaram.
international conference on acoustics, speech, and signal processing | 2010
Garimella S. V. S. Sivaram; Sridhar Krishna Nemala; Mounya Elhilali; Trac D. Tran; Hynek Hermansky
This paper proposes a novel feature extraction technique for speech recognition based on the principles of sparse coding. The idea is to express a spectro-temporal pattern of speech as a linear combination of an overcomplete set of basis functions such that the weights of the linear combination are sparse. These weights (features) are subsequently used for acoustic modeling. We learn a set of overcomplete basis functions (dictionary) from the training set by adopting a previously proposed algorithm which iteratively minimizes the reconstruction error and maximizes the sparsity of weights. Furthermore, features are derived using the learned basis functions by applying the well established principles of compressive sensing. Phoneme recognition experiments show that the proposed features outperform the conventional features in both clean and noisy conditions.
IEEE Transactions on Audio, Speech, and Language Processing | 2012
Garimella S. V. S. Sivaram; Hynek Hermansky
This paper introduces the sparse multilayer perceptron (SMLP) which jointly learns a sparse feature representation and nonlinear classifier boundaries to optimally discriminate multiple output classes. SMLP learns the transformation from the inputs to the targets as in multilayer perceptron (MLP) while the outputs of one of the internal hidden layers is forced to be sparse. This is achieved by adding a sparse regularization term to the cross-entropy cost and updating the parameters of the network to minimize the joint cost. On the TIMIT phoneme recognition task, SMLP-based systems trained on individual speech recognition feature streams perform significantly better than the corresponding MLP-based systems. Phoneme error rate of 19.6% is achieved using the combination of SMLP-based systems, a relative improvement of 3.0% over the combination of MLP-based systems.
international conference on acoustics, speech, and signal processing | 2011
Garimella S. V. S. Sivaram; Hynek Hermansky
This paper introduces the sparse multilayer perceptron (SMLP) which learns the transformation from the inputs to the targets as in multilayer perceptron (MLP) while the outputs of one of the internal hidden layers is forced to be sparse. This is achieved by adding a sparse regularization term to the cross-entropy cost and learning the parameters of the network to minimize the joint cost. On the TIMIT phoneme recognition task, the SMLP based system trained using perceptual linear prediction (PLP) features performs better than the conventional MLP based system. Furthermore, their combination yields a phoneme error rate of 21.2%, a relative improvement of 6.2% over the baseline.
ACM Transactions on Multimedia Computing, Communications, and Applications | 2009
Garimella S. V. S. Sivaram; Mohan S. Kankanhalli; K. R. Ramakrishnan
This article addresses the problem of how to select the optimal combination of sensors and how to determine their optimal placement in a surveillance region in order to meet the given performance requirements at a minimal cost for a multimedia surveillance system. We propose to solve this problem by obtaining a performance vector, with its elements representing the performances of subtasks, for a given input combination of sensors and their placement. Then we show that the optimal sensor selection problem can be converted into the form of Integer Linear Programming problem (ILP) by using a linear model for computing the optimal performance vector corresponding to a sensor combination. Optimal performance vector corresponding to a sensor combination refers to the performance vector corresponding to the optimal placement of a sensor combination. To demonstrate the utility of our technique, we design and build a surveillance system consisting of PTZ (Pan-Tilt-Zoom) cameras and active motion sensors for capturing faces. Finally, we show experimentally that optimal placement of sensors based on the design maximizes the system performance.
IEEE Signal Processing Letters | 2010
Garimella S. V. S. Sivaram; Sridhar Krishna Nemala; Nima Mesgarani; Hynek Hermansky
This paper proposes novel data-driven and feedback based discriminative spectro-temporal filters for feature extraction in automatic speech recognition (ASR). Initially a first set of spectro-temporal filters are designed to separate each phoneme from the rest of the phonemes. A hybrid Hidden Markov Model/Multilayer Perceptron (HMM/MLP) phoneme recognition system is trained on the features derived using these filters. As a feedback to the feature extraction stage, top confusions of this system are identified, and a second set of filters are designed specifically to address these confusions. Phoneme recognition experiments on TIMIT show that the features derived from the combined set of discriminative filters outperform conventional speech recognition features, and also contain significant complementary information.
international conference on acoustics, speech, and signal processing | 2012
Daniel Garcia-Romero; Xinhui Zhou; Dmitry N. Zotkin; Balaji Vasan Srinivasan; Yuancheng Luo; Sriram Ganapathy; Samuel Thomas; Sridhar Krishna Nemala; Garimella S. V. S. Sivaram; Majid Mirbagheri; Sri Harish Reddy Mallidi; Thomas Janu; Padmanabhan Rajan; Nima Mesgarani; Mounya Elhilali; Hynek Hermansky; Shihab A. Shamma; Ramani Duraiswami
In recent years, there have been significant advances in the field of speaker recognition that has resulted in very robust recognition systems. The primary focus of many recent developments have shifted to the problem of recognizing speakers in adverse conditions, e.g in the presence of noise/reverberation. In this paper, we present the UMD-JHU speaker recognition system applied on the NIST 2010 SRE task. The novel aspects of our systems are: 1) Improved performance on trials involving different vocal effort via the use of linear-scale features; 2) Expected improved recognition performance in the presence of reverberation and noise via the use of frequency domain perceptual linear predictor and cortical features; 3) A new discriminative kernel partial least squares (KPLS) framework that complements state-of-the-art back-end systems JFA and PLDA to aid in better overall recognition; and 4) Acceleration of JFA, PLDA and KPLS back-ends via distributed computing. The individual components of the system and the fused system are compared against a baseline JFA system and results reported by SRI and MIT-LL on SRE2010.
international conference on acoustics, speech, and signal processing | 2011
Balakrishnan; Garimella S. V. S. Sivaram; Sanjeev Khudanpur
In this paper, we present a novel technique for modeling the posterior probability estimates obtained from a neural network directly in the HMM framework using the Dirichlet Mixture Models (DMMs). Since posterior probability vectors lie on a probability simplex their distribution can be modeled using DMMs. Being in an exponential family, the parameters of DMMs can be estimated in an efficient manner. Conventional approaches like TANDEM attempt to gaussianize the posteriors by suitable transforms and model them using Gaussian Mixture Models (GMMs). This requires more number of parameters as it does not exploit the fact that the probability vectors lie on a simplex. We demonstrate through TIMIT phoneme recognition experiments that the proposed technique outperforms the conventional TANDEM approach.
international conference on acoustics, speech, and signal processing | 2009
Joel Praveen Pinto; Garimella S. V. S. Sivaram; Hynek Hermansky; Mathew Magimai-Doss
We present a framework to apply Volterra series to analyze multi-layered perceptrons trained to estimate the posterior probabilities of phonemes in automatic speech recognition. The identified Volterra kernels reveal the spectro-temporal patterns that are learned by the trained system for each phoneme. To demonstrate the applicability of Volterra series, we analyze a multilayered perceptron trained using Mel filter bank energy features and analyze its first order Volterra kernels.
text speech and dialogue | 2008
Garimella S. V. S. Sivaram; Hynek Hermansky
This paper proposes modifications to the Multi-resolution RASTA (MRASTA) feature extraction technique for the automatic speech recognition (ASR). By emulating asymmetries of the temporal receptive field (TRF) profiles of higher level auditory neurons, we obtain more than 11.4% relative improvement in word error rate on OGI-Digits database. Experiments on TIMIT database confirm that proposed modifications are indeed useful.
text speech and dialogue | 2008
Joel Praveen Pinto; Garimella S. V. S. Sivaram; Hynek Hermansky
In this work, we investigate the reverse correlation technique for analyzing posterior feature extraction using an multilayered perceptron trained on multi-resolution RASTA (MRASTA) features. The filter bank in MRASTA feature extraction is motivated by human auditory modeling. The MLP is trained based on an error criterion and is purely data driven. In this work, we analyze the functionality of the combined system using reverse correlation analysis.