S. R. M. Prasanna
Indian Institute of Technology Guwahati
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by S. R. M. Prasanna.
IEEE Signal Processing Letters | 2007
K.S. Rao; S. R. M. Prasanna; B. Yegnanarayana
This letter proposes a time-effective method for determining the instants of significant excitation in speech signals. The instants of significant excitation correspond to the instants of glottal closure (epochs) in the case of voiced speech, and to some random excitations like onset of burst in the case of nonvoiced speech. The proposed method consists of two phases: the first phase determines the approximate epoch locations using the Hilbert envelope of the linear prediction residual of the speech signal. The second phase determines the accurate locations of the instants of significant excitation by computing the group delay around the approximate epoch locations derived from the first phase. The accuracy in determining the instants of significant excitation and the time complexity of the proposed method is compared with the group delay based approach.
IEEE Transactions on Speech and Audio Processing | 2005
B. Yegnanarayana; S. R. M. Prasanna; Ramani Duraiswami; Dmitry N. Zotkin
In this paper, we present a method of extracting the time-delay between speech signals collected at two microphone locations. Time-delay estimation from microphone outputs is the first step for many sound localization algorithms, and also for enhancement of speech. For time-delay estimation, speech signals are normally processed using short-time spectral information (either magnitude or phase or both). The spectral features are affected by degradations in speech caused by noise and reverberation. Features corresponding to the excitation source of the speech production mechanism are robust to such degradations. We show that these source features can be extracted reliably from the speech signal. The time-delay estimate can be obtained using the features extracted even from short segments (50-100 ms) of speech from a pair of microphones. The proposed method for time-delay estimation is found to perform better than the generalized cross-correlation (GCC) approach. A method for enhancement of speech is also proposed using the knowledge of the time-delay and the information of the excitation source.
IEEE Transactions on Audio, Speech, and Language Processing | 2011
S. R. M. Prasanna; Gayadhar Pradhan
Vowel-like regions (VLRs) in speech includes vowels, semi-vowels, and diphthong sound units. VLR can be identified using a vowel-like region onset point (VLROP) event. By production, the VLR has impulse-like excitation and therefore information about the vocal tract system may be better manifested in them. Also, the VLR is a relatively high signal-to-noise ratio (SNR) region. Speaker information extracted from such a region may therefore be more speaker discriminative and relatively less affected by the degradations like noise, reverberation, and sensor mismatches. Due to this, better speaker modeling and reliable testing may be possible. In this paper, VLRs are detected using the knowledge of VLROPs during training and testing. Features from the VLRs are then used for training and testing the speaker models. As a result, significant improvement in the performance is reported for speaker verification under degraded conditions.
IEEE Transactions on Speech and Audio Processing | 2005
Vikas C. Raykar; B. Yegnanarayana; S. R. M. Prasanna; Ramani Duraiswami
This paper presents the results of simulation and real room studies for localization of a moving speaker using information about the excitation source of speech production. The first step in localization is the estimation of time-delay from speech collected by a pair of microphones. Methods for time-delay estimation generally use spectral features that correspond mostly to the shape of vocal tract during speech production. Spectral features are affected by degradations due to noise and reverberation. This paper proposes a method for localizing a speaker using features that arise from the excitation source during speech production. Experiments were conducted by simulating different noise and reverberation conditions to compare the performance of the time-delay estimation and source localization using the proposed method with the results obtained using the spectrum-based generalized cross correlation (GCC) methods. The results show that the proposed method shows lower number of discrepancies in the estimated time-delays. The bias, variance and the root mean square error (RMSE) of the proposed method is consistently equal or less than the GCC methods. The location of a moving speaker estimated using the time-delays obtained by the proposed method are closer to the actual values, than those obtained by the GCC method.
IEEE Transactions on Audio, Speech, and Language Processing | 2009
P. Krishnamoorthy; S. R. M. Prasanna
This paper presents an approach for the enhancement of reverberant speech by temporal and spectral processing. Temporal processing involves identification and enhancement of high signal-to-reverberation ratio (SRR) regions in the temporal domain. Spectral processing involves removal of late reverberant components in the spectral domain. First, the spectral subtraction-based processing is performed to eliminate the late reverberant components, and then the spectrally processed speech is further subjected to the excitation source information-based temporal processing to enhance the high SRR regions. The objective measures segmental SRR and log spectral distance are computed for different cases, namely, reverberant, spectral processed, temporal processed, and combined temporal and spectral processed speech signals. The quality of the speech signal that is processed by the temporal and spectral processing is significantly enhanced compared to the reverberant speech as well as the signals that are processed by the individual temporal and spectral processing methods.
IEEE Transactions on Audio, Speech, and Language Processing | 2013
Gayadhar Pradhan; S. R. M. Prasanna
This work proposes methods for detecting vowel-like regions (VLRs) and non-vowel-like regions (non-VLRs) using excitation source information. The VLR onset and end points are hypothesized and used in an iterative algorithm for detecting the VLRs. Next, for detection of non-VLRs, the linear prediction (LP) residual samples in the VLRs are attenuated significantly to indirectly emphasize the residual samples in the non-VLRs. The modified LP residual samples excite the time varying all pole filter to reconstruct non-VLRs enhanced speech and used for detecting non-VLRs. The VLRs and non-VLRs are used independently during training and testing of a speaker verification (SV) system to reduce gross level mismatch due to sound units and achieve better compensation of degradation effects by applying different normalization to these two different energy regions. Finally, the scores are combined with higher weight on VLRs, which are more speaker specific. Experiments verify that the proposed approach provides improved performance for clean and degraded speech. On the NIST-2003 speaker recognition database, using VLRs and non-VLRs improves the equal error rate from 6.63% to 6% and from 2.29% to 1.89% for a GMM-UBM based and an i-vector based SV system, respectively.
Speech Communication | 2011
P. Krishnamoorthy; S. R. M. Prasanna
This paper presents a noisy speech enhancement method by combining linear prediction (LP) residual weighting in the time domain and spectral processing in the frequency domain to provide better noise suppression as well as better enhancement in the speech regions. The noisy speech is initially processed by the excitation source (LP residual) based temporal processing that involves identifying and enhancing the excitation source based speech-specific features present at the gross and fine temporal levels. The gross level features are identified by estimating the following speech parameters: sum of the peaks in the discrete Fourier transform (DFT) spectrum, smoothed Hilbert envelope of the LP residual and modulation spectrum values, all from the noisy speech signal. The fine level features are identified using the knowledge of the instants of significant excitation. A weight function is derived from the gross and fine weight functions to obtain the temporally processed speech signal. The temporally processed speech is further subjected to spectral domain processing. Spectral processing involves estimation and removal of degrading components, and also identification and enhancement of speech-specific spectral components. The proposed method is evaluated using different objective and subjective quality measures. The quality measures show that the proposed combined temporal and spectral processing method provides better enhancement, compared to either temporal or spectral processing alone.
national conference on communications | 2011
B C Haris; Gayadhar Pradhan; A. Misra; Sumitra Shukla; Rohit Sinha; S. R. M. Prasanna
In this paper, we present our initial study with the recently collected speech database for developing robust speaker recognition systems in Indian context. The database contains the speech data collected across different sensors, languages, speaking styles, and environments, from 200 speakers. The speech data is collected across five different sensors in parallel, in English and multiple Indian languages, in reading and conversational speaking styles, and in office and uncontrolled environments such as laboratories, hostel rooms and corridors etc. The collected database is evaluated using adapted Gaussian mixture model based speaker verification system following the NIST 2003 speaker recognition evaluation protocol and gives comparable performance to those obtained using NIST data sets. Our initial study exploring the impact of mismatch in training and test conditions with collected data finds that the mismatch in sensor, speaking style, and environment result in significant degradation in performance compared to the matched case whereas for language mismatch case the degradation is found to be relatively smaller.
national conference on communications | 2014
Subhadeep Dey; Sujit Barman; Ramesh K. Bhukya; Rohan Kumar Das; B C Haris; S. R. M. Prasanna; Rohit Sinha
In this paper we present the development and implementation of a speech biometric based attendance system. The users access the system by making a call from few pre-decided mobile phones. An interactive voice response (IVR) system guides a new user in the enrollment and an enrolled user in the verification processes. The system uses text independent speaker verification with MFCC features and i-vector based speaker modeling for authenticating the user. Linear discriminant analysis and within class covariance normalization are used for normalizing the effects due to session/environment variations. A simple cosine distance scoring along with score normalization is used as the classifier and a fixed threshold is used for making the decision. The developed system has been used by a group of 110 students for about two months on a regular basis. The system performance in terms of recognition rate is found to be 94.2 % and the average response time of the system for a test data of duration 50 seconds is noted to be 26 seconds.
International Journal of Speech Technology | 2013
D. Govind; S. R. M. Prasanna
The objective of the present work is to provide a detailed review of expressive speech synthesis (ESS). Among various approaches for ESS, the present paper focuses the development of ESS systems by explicit control. In this approach, the ESS is achieved by modifying the parameters of the neutral speech which is synthesized from the text. The present paper reviews the works addressing various issues related to the development of ESS systems by explicit control. The review provided in this paper include, review of the various approaches for text to speech synthesis, various studies on the analysis and estimation of expressive parameters and various studies on methods to incorporate expressive parameters. Finally the review is concluded by mentioning the scope of future work for ESS by explicit control.