S. Umesh
Indian Institute of Technology Madras
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by S. Umesh.
ieee automatic speech recognition and understanding workshop | 2013
N. Vishnu Prasad; S. Umesh
Cepstral Mean and Variance Normalization (CMVN) is a computationally efficient normalization technique for noise robust speech recognition. The performance of CMVN is known to degrade for short utterances, due to insufficient data for parameter estimation and loss of discriminable information as all utterances are forced to have zero mean and unit variance. In this work, we propose to use posterior estimates of mean and variance in CMVN, instead of the maximum likelihood estimates. This Bayesian approach, in addition to providing a robust estimate of parameters, is also shown to preserve discriminable information without increase in computational cost, making it particularly relevant for Interactive Voice Response (IVR)-based applications. The relative WER reduction of this approach w.r.t. Cepstral Mean Normalization, CMVN and Histogram Equalization are (i) 40.1%, 27% and 4.3% with the Aurora2 database for all utterances, (ii) 25.7%, 38.6% and 30.4% with the Aurora2 database for short utterances, and (iii) 18.7%, 12.6% and 2.5% with the Aurora4 database.
ieee automatic speech recognition and understanding workshop | 2013
Vimal Manohar; C. Bhargav Srinivas; S. Umesh
In this paper, we propose a new acoustic modeling technique called the Phone-Cluster Adaptive Training. In this approach, the parameters of context-dependent states are obtained by the linear interpolation of several monophone cluster models, which are themselves obtained by adaptation using linear transformation of a canonical Gaussian Mixture Model (GMM). This approach is inspired from the Cluster Adaptive Training (CAT) for speaker adaptation and the Subspace Gaussian Mixture Model (SGMM). The parameters of the model are updated in an adaptive training framework. The interpolation vectors implicitly capture the phonetic context information. The proposed approach shows substantial improvement over the Continuous Density Hidden Markov Model (CDHMM) and a similar performance to that of the SGMM, while using significantly fewer parameters than both the CDHMM and the SGMM.
national conference on communications | 2010
A. K. Sarkar; S. P. Rath; S. Umesh
In speaker identification, most of the computational processing time is required to calculate the likelihood of the test utterance of the unknown speaker with respect to the speaker models in the database. When number of speakers in the database is in the order of 10,000 or more, then computational complexity becomes very high. In this paper, we propose a Maximum Likelihood Linear Regression (MLLR) based fast method to calculate the likelihood from the speaker model using the MLLR matrix. The proposed technique will help to quickly find the best N speakers during identification. After that final speaker identification task can be done within the N selected speakers using any conventional method of speaker identification. The comparative study of the proposed method is done in terms of processing time with the state-of-the-art GMM-UBM based system on NIST 2004 SRE. The proposed technique performs faster than GMM-UBM based system with some degradation in system accuracy.
international conference on acoustics, speech, and signal processing | 2012
Achintya Kumar Sarkar; S. Umesh; Jean-François Bonastre
In this paper, we propose a computationally efficient method to identify a speaker from a large population of speakers. The proposed method is based on our earlier [1] Fast Maximum Likelihood Linear Linear Regression (MLLR) anchor modeling technique which provides performance comparable to the conventional anchor modeling system and yet reduces computation time significantly by computing likelihood efficiently using sufficient statistics of data and anchor specific MLLR matrix. However, both these systems still require a Gaussian Mixture Model-Universal Background Model (GMM-UBM) based back-end system to choose the optimal speaker, which is computationally heavy. In our proposed method, we show that applying Linear-Discriminant Analysis (LDA) and Within-Class-Covariance Normalization (WCCN) on the Speaker characterization Vector (SCV) of our recently proposed Fast-MLLR method, we can combine the computational efficiency and the discriminant capability to have a system that uses simple cosine-distance measure to identify speakers and yet has significantly superior performance compared to both full-blown GMM-UBM system and the anchor-model system. More importantly, there is no need of the “back-end” system. Experimental result on NIST 2004 SRE shows that the proposed method reduces identification error rate by an absolute 2% and takes only 2/3 of the time taken by efficient Fast-MLLR system and only 20% of the time taken by the stand-alone GMM-UBM system.
national conference on communications | 2014
Neethu Mariam Joy; Basil Abraham; K Navneeth; S. Umesh
Cross-lingual acoustic modeling using Subspace Gaussian Mixture Model for low-resource languages of Indian origin is investigated. Building acoustic model for a low-resource language with limited vocabulary by leveraging resources from another language with comparatively larger resources was focused upon. Experiments were done on Bengali and Tamil corpus from MANDI database, with Tamil having greater resources than Bengali. We observed that the word accuracy of cross-lingual acoustic model of Bengali was approximately 2.5% above its CDHMM model and gave equivalent performance as its monolingual SGMM model.
international conference on acoustics, speech, and signal processing | 2011
Achintya Kumar Sarkar; S. Umesh
Recently, Multiple Background Models (M-BMs) [1, 2] have been shown to be useful in speaker verification, where the M-BMs are formed based on different Vocal Tract Lengths (VTLs) among the population. The speaker models are adapted from the particular Background Model (BM) corresponding to their VTL. During test, log likelihood ratio of the test utterance is calculated between claimant model and the corresponding BM. In this paper, instead of using different BM for different speaker, we propose the use of single gender, channel and VTL independent UBM (root-UBM) using the concept of VTL dependent mapping function. The proposed concept is inspired by Feature Mapping (FM) technique used in speaker verification to overcome channel variability. In our proposed method, VTL specific gender independent Gaussian Mixture models (GMMs) are derived from the root-UBM using Maximum a posteriori (MAP) adaptation. The mapping relation is then learned between the root-UBM and the VTL-specific GMM. During training and testing phase, feature vectors are mapped into root-UBM using the best VTL specific model. Then speaker models are adapted from the root-UBM using mapped features. During test, the log likelihood ratio is calculated between target model and root-UBM. Therefore, unlike M-BM system, there is no need to switch to different BMs depending on the claimant. Another advantage of the proposed method is that other additional normalization/compensation techniques can be easily applied since it is in a single UBM frame-work. The experiments are performed on NIST 2004 SRE core condition, and we show that the performance of the proposed method is close to the M-BM system with and without score normalization.
national conference on communications | 2017
Seeram Tejaswi; S. Umesh
In this paper, we investigate various training methods for building deep neural network (DNN) based acoustic models for dysarthric speech data. Methods like multitask learning, knowledge distillation and model adaptation, which overcome data sparsity and model over-fitting problems are employed to study the merits of each method. In Knowledge distillation framework, some privilege information in addition to featurelabels pairs available only during training, is exploited to help the model learn better without using such previleged information during testing [1]; knowledge from one model can be distilled to another and thereby guiding it in learning better [2]. In this work, a DNN acoustic model trained using data pooled from Dysarthric speech data and parallel un-impaired data is used as the intelligent teacher while the student DNN model is trained using only Dysarthric speech. The target label for training the student model is a combination of hard aligned labels and those obtained from forward-pass through the teacher model. In addition to this technique, other knowledge sharing techniques like multitask learning were explored for Dysarthric speech data and have found to show a relative improvement of 11% over the corresponding baseline models.
Speech Communication | 2017
Basil Abraham; S. Umesh
Recent studies have shown that in the case of under-resourced languages, use of articulatory features (AF) emerging from an articulatory model results in improved automatic speech recognition (ASR) compared to conventional mel frequency cepstral coefficient (MFCC) features. Articulatory features are more robust to noise and pronunciation variability compared to conventional acoustic features. To extract articulatory features, one method is to take conventional acoustic features like MFCC and build an articulatory classifier that would output articulatory features (known as pseudo-AF). However, these classifiers require a mapping from phone to different articulatory labels (AL) (e.g., place of articulation and manner of articulation), which is not readily available for many of the under-resourced languages. In this article, we have proposed an automated technique to generate phone-to-articulatory label (phone-to-AL) mapping for a new target language based on the knowledge of phone-to-AL mapping of a well-resourced language. The proposed mapping technique is based on the center-phone capturing property of interpolation vectors emerging from the recently proposed phone cluster adaptive training (Phone-CAT) method. Phone-CAT is an acoustic modeling technique that belongs to the broad category of canonical state models (CSM) that includes subspace Gaussian mixture model (SGMM). In Phone-CAT, the interpolation vector belonging to a particular context-dependent state has maximum weight for the center-phone in case of monophone clusters or by the AL of the center-phone in case of AL clusters. These relationships from the various context-dependent states are used to generate a phone-to-AL mapping. The Phone-CAT technique makes use of all the speech data belonging to a particular context-dependent state. Therefore, multiple segments of speech are used to generate the mapping, which makes it more robust to noise and other variations. In this study, we have obtained a phone-to-AL mapping for three under-resourced Indian languages namely Assamese, Hindi and Tamil based on the phone-to-AL mapping available for English. With the generated mappings, articulatory features are extracted for these languages using varying amounts of data in order to build an articulatory classifier. Experiments were also performed in a cross-lingual scenario assuming a small training data set (ź 2źh) from each of the Indian languages with articulatory classifiers built using a lot of training data (ź 22źh) from other languages including English (Switchboard task). Interestingly, cross-lingual performance is comparable to that of an articulatory classifier built with large amounts of native training data. Using articulatory features, more than 30% relative improvement was observed over the conventional MFCC features for all the three languages in a DNN framework.
conference of the international speech communication association | 2016
Basil Abraham; S. Umesh; Neethu Mariam Joy
In this paper, we propose two techniques to improve the acoustic model of a low-resource language by: (i) Pooling data from closely related languages using a phoneme mapping algorithm to build acoustic models like subspace Gaussian mixture model (SGMM), phone cluster adaptive training (Phone-CAT), deep neural network (DNN) and convolutional neural network (CNN). Using the low-resource language data, we then adapt the afore mentioned models towards that language. (ii) Using models built from high-resource languages, we first borrow subspace model parameters from SGMM/Phone-CAT; or hidden layers from DNN/CNN. The language specific parameters are then estimated using the lowresource language data. The experiments were performed on four Indian languages namely Assamese, Bengali, Hindi and Tamil. Relative improvements of 10 to 30% were obtained over corresponding monolingual models in each case.
conference of the international speech communication association | 2016
Neethu Mariam Joy; Murali Karthick Baskar; S. Umesh; Basil Abraham
In this paper, we propose the use of deep neural networks (DNN) as a regression model to estimate feature-space maximum likelihood linear regression (FMLLR) features from unnormalized features. During training, the pair of unnormalized features as input and corresponding FMLLR features as target are provided and the network is optimized to reduce the mean-square error between output and target FMLLR features. During test, the unnormalized features are passed through this DNN feature extractor to obtain FMLLR-like features without any supervision or first pass decode. Further, the FMLLR-like features are generated frame-by-frame, requiring no explicit adaptation data to extract the features unlike in FMLLR or ivector. Our proposed approach is therefore suitable for scenarios where there is little adaptation data. The proposed approach provides sizable improvements over basis-FMLLR and conventional FMLLR when normalization is done at utterance level on TIMIT and Switchboard-33hour data sets.