Srinivasan Umesh
Indian Institute of Technology Kanpur
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Srinivasan Umesh.
IEEE Transactions on Speech and Audio Processing | 1999
Srinivasan Umesh; Leon Cohen; Nenad M. Marinovic; Douglas J. Nelson
In this paper, we study the scale transform of the spectral-envelope of speech utterances by different speakers. This study is motivated by the hypothesis that the formant frequencies between different speakers are approximately related by a scaling constant for a given vowel. The scale transform has the fundamental property that the magnitude of the scale-transform of a function X(f) and its scaled version /spl radic//spl alpha/X(/spl alpha/f) are same. The methods presented here are useful in reducing variations in acoustic features. We show that the F-ratio tests indicate better separability of vowels by using scale-transform based features than mel-transform based features. The data used in the comparison of the different features consist of 200 utterances of four vowels that are extracted from the TIMIT database.
IEEE Signal Processing Letters | 2002
Srinivasan Umesh; Leon Cohen; Douglas J. Nelson
We present experimental results that show that the scale-factor relating the formant frequencies of different speakers increases with decreasing values of formant frequency. Based on these results, we experimentally obtain a frequency warping function aimed at separating speaker dependencies from the inherent characterization of the sound. We find that the frequency warping function is similar to the Mel scale, and we believe that this is the first time that a Mel-like scale has been obtained using only speech. Our results and methods may therefore explain, from a speech point of view, the Mel scale, which was obtained historically from hearing based experiments.
international conference on acoustics, speech, and signal processing | 2002
Rohit Sinha; Srinivasan Umesh
We present experimental results that show better speaker nonnalization using our previously reported frequency warping function that is derived purely from speech data. In our previous work, we have numerically computed the frequency warping function for non-uniform scaling, which is similar to mel-scale, such that spectral envelopes from different speakers enunciating the same sound are similar except for a possible translation factor. In this paper, we do a maximum likelihood search for these translation parameters and show that this non-uniform normalization scheme provides about 18 % improvement over the normalization method based on the maximum likelihood estimate of uniform scaling parameters and about 30 % improvement over mel filterbank cepstral coefficient based baseline for a telephone based continuous digit recognition task. The other attractive attribute of the proposed method is the simplicity in generating features with different shifts compared to generating features with different warping factors in earlier methods.
IEEE Journal of Oceanic Engineering | 1993
Donald W. Tufts; Hongya Ge; Srinivasan Umesh
A computationally efficient fast maximum-likelihood (FML) estimation scheme, which makes use of the shape of the surface of the compressed likelihood function (CLF), is proposed. The scheme uses only multiple one-dimensional searches oriented along appropriate ridges on the surface of the CLF. Simulations indicate that the performances of the proposed estimators match those of the corresponding maximum-likelihood estimators with very high probability. The approach is demonstrated by applying it to two different problems. The first problem involves the estimation of time of arrival and Doppler compression of a wideband hyperbolic frequency modulated (HFM) active sonar signal buried in reverberation. The second problem deals with estimating the frequencies of sinusoids. A threshold analysis of the proposed scheme is carried out to predict the signal-to-noise ratio (SNR) at which large estimation errors begin to occur, i.e., the threshold SNR, and its computational complexity is discussed. >
international conference on acoustics speech and signal processing | 1996
Srinivasan Umesh; Douglas J. Nelson
We propose a computationally efficient method for estimation of frequency of a single complex sinusoid at low SNR. This method is motivated by the cross-power spectrum method of Nelson (1993) and the weighted phase averager (WPA) methods of Tretter (1985), Kay (1988), and Lovell et al. (1991). We demonstrate that by a simple preprocessing, we can extend the threshold SNR of the WPA significantly. Further, unlike the WPA, the proposed method can be easily extended for estimation of frequencies of multiple sinusoids that are well-separated in frequency. We also derive the variance of the proposed estimator and provide simulation results comparing the proposed method with the WPA.
international conference on spoken language processing | 1996
Srinivasan Umesh; Leon Cohen; Nenad M. Marinovic; Douglas J. Nelson
We present results that indicate that the formant frequencies between different speakers scale differently at different frequencies. Based on our experiments on speech data, we then numerically compute a universal frequency-warping function, to make the scale-factor independent of frequency in the warped domain. The proposed warping function is found to be similar to the mel-scale, which has previously been derived from purely psycho-acoustic experiments. The motivation for the present experiments stems from our proposed use of scale-transform based cepstral coefficients (Umesh et al., 1996) as acoustic features, since they provide superior separability of vowels than mel-cepstral coefficients.
international conference on acoustics, speech, and signal processing | 1997
Srinivasan Umesh; Leon Cohen; Douglas J. Nelson
We have proposed the use of scale-cepstral coefficients as features in speech recognition. We have developed a corresponding frequency-warping function, such that, in the warped domain the formant envelopes of different speakers are approximately translated versions of one and another for any given vowel. These methods were motivated by a desire to achieve speaker-normalization. In this paper, we point out very interesting parallels of the various steps in computing the scale-cepstrum, with those observed in computing features based on physiological models of the auditory system or psychoacoustic experiments. It may therefore be useful to have a better understanding of the need for the various signal-processing steps which may result in the development of more robust recognizers.
IEEE Transactions on Audio, Speech, and Language Processing | 2007
Srinivasan Umesh; Rohit Sinha
In this paper, we study the effect of filter bank smoothing on the recognition performance of childrens speech. Filter bank smoothing of spectra is done during the computation of the Mel filter bank cepstral coefficients (MFCCs). We study the effect of smoothing both for the case when there is vocal-tract length normalization (VTLN) as well as for the case when there is no VTLN. The results from our experiments indicate that unlike conventional VTLN implementation, it is better not to scale the bandwidths of the filters during VTLN - only the filter center frequencies need be scaled. Our interpretation of the above result is that while the formant center frequencies may approximately scale between speakers, the formant bandwidths do not change significantly. Therefore, the scaling of filter bandwidths by a warp-factor during conventional VTLN results in differences in spectral smoothing leading to degradation in recognition performance. Similarly, results from our experiments indicate that for telephone-based speech when there is no normalization it is better to use uniform-bandwidth filters instead of the constant- like filters that are used in the computation of conventional MFCC. Our interpretation is that with constant- filters there is excessive spectral smoothing at higher frequencies which leads to degradation in performance for childrens speech. However, the use of constant- filters during VTLN does not create any additional performance degradation. As we will show, during VTLN it is only important that the filter bandwidths are not scaled irrespective of whether we use constant- or uniform-bandwidth filters. With our proposed changes in the filter bank implementation we get comparable performance for adults and about 6% improvement for children both for the case of using VTLN as well as the for the case of not using VTLN on a telephone-based digit recognition task.
Speech Communication | 2008
Rohit Sinha; Srinivasan Umesh
In this work, we present a speaker-normalization method based on the idea that the speaker-dependent scale-factor can be separated out as a fixed translation factor in an alternate domain. We also introduce a non-linear frequency-scaling model motivated by the analysis of speech data. The proposed shift-based normalization approach is implemented using a maximum-likelihood (ML) search for the translation factor in the alternate domain. The advantage of our approach is that we are able to show the relationship between conventional frequency-warping based vocal-tract length normalization (VTLN) methods and the methods based on shifts in psycho-acoustic scale thus providing a unifying frame-work for speaker-normalization. Additionally, in our approach it is simple to show that the shifting required for normalization can be expressed as a linear transformation in the cepstral domain. This is important for computational efficiency since we do not have to recompute the features by re-doing the signal processing for each scale/translation factor as is usually done in conventional normalization. We present recognition results using our proposed approach on a digit recognition task and show that the non-linear scaling model provides relative improvement of 4% for adults and 7.5% for children when compared to the linear-scaling model.
international conference on acoustics, speech, and signal processing | 2002
Srinivasan Umesh; S. V. Bharath Kumar; M. K. Vinay; Rajesh Sharma; Rohit Sinha
In this paper, we present results of non-uniform vowel normalization and show that the frequency-warping necessary to do nonuniform vowel nonnalization is similar to the mel-scale. We compare our methods to Fants non-uniform vowel normalization method and show that with proposed frequency warping approach we can achieve similar performance without any knowledge of the spoken vowel and the fonnant number. The proposed approach is motivated by a desire to perform non-uniform speaker normalization in automatic speech recognition systems. We also present results of a more comprehensive study of our earlier work on non-uniform scaling which again shows that mel-scale is the appropriate warping function. All the results in this paper are based on data from Peterson & Barney and Hillenbrand et al. vowel databases.