B. Yegnanarayana
International Institute of Information Technology, Hyderabad
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by B. Yegnanarayana.
IEEE Signal Processing Letters | 2006
K.S.R. Murty; B. Yegnanarayana
The objective of this letter is to demonstrate the complementary nature of speaker-specific information present in the residual phase in comparison with the information present in the conventional mel-frequency cepstral coefficients (MFCCs). The residual phase is derived from speech signal by linear prediction analysis. Speaker recognition studies are conducted on the NIST-2003 database using the proposed residual phase and the existing MFCC features. The speaker recognition system based on the residual phase gives an equal error rate (EER) of 22%, and the system using the MFCC features gives an EER of 14%. By combining the evidence from both the residual phase and the MFCC features, an EER of 10.5% is obtained, indicating that speaker-specific excitation information is present in the residual phase. This information is useful since it is complementary to that of MFCCs.
IEEE Transactions on Audio, Speech, and Language Processing | 2008
K.S.R. Murty; B. Yegnanarayana
Epoch is the instant of significant excitation of the vocal-tract system during production of speech. For most voiced speech, the most significant excitation takes place around the instant of glottal closure. Extraction of epochs from speech is a challenging task due to time-varying characteristics of the source and the system. Most epoch extraction methods attempt to remove the characteristics of the vocal-tract system, in order to emphasize the excitation characteristics in the residual. The performance of such methods depends critically on our ability to model the system. In this paper, we propose a method for epoch extraction which does not depend critically on characteristics of the time-varying vocal-tract system. The method exploits the nature of impulse-like excitation. The proposed zero resonance frequency filter output brings out the epoch locations with high accuracy and reliability. The performance of the method is demonstrated using CMU-Arctic database using the epoch information from the electroglottograph as reference. The proposed method performs significantly better than the other methods currently available for epoch extraction. The interesting part of the results is that the epoch extraction by the proposed method seems to be robust against degradations like white noise, babble, high-frequency channel, and vehicle noise.
IEEE Transactions on Acoustics, Speech, and Signal Processing | 1979
Tirupattur Ananthapadmanabha; B. Yegnanarayana
In voiced speech analysis epochal information is useful in accurate estimation of pitch periods and the frequency response of the vocal tract system. Ideally, linear prediction (LP) residual should give impulses at epochs. However, there are often ambiguities in the direct use of LP residual since samples of either polarity occur around epochs. Further, since the digital inverse filter does not compensate the phase response of the vocal tract system exactly, there is an uncertainty in the estimated epoch position. In this paper we present an interpretation of LP residual by considering the effect of the following factors: 1) the shape of glottal pulses, 2) inaccurate estimation of formants and bandwidths, 3) phase angles of formants at the instants of excitation, and 4) zeros in the vocal tract system. A method for the unambiguous identification of epochs from LP residual is then presented. The accuracy of the method is tested by comparing the results with the epochs obtained from the estimated glottal pulse shapes for several vowel segments. The method is used to identify the closed glottis interval for the estimation of the true frequency response of the vocal tract system.
IEEE Transactions on Speech and Audio Processing | 2000
B. Yegnanarayana; P.S. Murthy
We propose a new method of processing speech degraded by reverberation. The method is based on analysis of short (2 ms) segments of data to enhance the regions in the speech signal having a high signal-to-reverberant component ratio (SRR). The short segment analysis shows that SRR is different in different segments of speech. The processing method involves identifying and manipulating the linear prediction residual signal in three different regions of the speech signal, namely, high SRR region, low SRR region, and only reverberation component region. A weight function is derived to modify the linear prediction residual signal. The weighted residual signal samples are used to excite a time-varying all-pole filter to obtain perceptually enhanced speech. The method is robust to noise present in the recorded speech signal. The performance is illustrated through spectrograms, subjective and objective evaluations.
IEEE Transactions on Speech and Audio Processing | 1995
Rlhm Roel Smits; B. Yegnanarayana
A new method for determining the instants of significant excitation in speech signals is proposed. In the paper, significant excitation refers primarily to the instant of glottal closure within a pitch period in voiced speech. The method is based on the global phase characteristics of minimum phase signals. The average slope of the unwrapped phase of the short-time Fourier transform of linear prediction residual is calculated as a function of time. Instants where the phase slope function makes a positive zero-crossing are identified as significant excitations. The method is discussed in a source-filter context of speech production. The method is not sensitive to the characteristics of the filter. The influence of the type, length, and position of the analysis window is discussed. The method works well for all types of voiced speech in male as well as female speech but, in all cases, under noise-free conditions only. >
Lecture Notes in Computer Science | 2003
Amit A. Kale; Naresh P. Cuntoor; B. Yegnanarayana; A. N. Rajagopalan; Rama Chellappa
Human gait is an attractive modality for recognizing people at a distance. In this paper we adopt an appearance-based approach to the problem of gait recognition. The width of the outer contour of the binarized silhouette of a walking person is chosen as the basic image feature. Different gait features are extracted from the width vector such as the dowsampled, smoothed width vectors, the velocity profile etc. and sequences of such temporally ordered feature vectors are used for representing a persons gait. We use the dynamic time-warping (DTW) approach for matching so that non-linear time normalization may be used to deal with the naturally-occuring changes in walking speed. The performance of the proposed method is tested using different gait databases.
Speech Communication | 1995
M. Narendranath; Hema A. Murthy; S. Rajendran; B. Yegnanarayana
In this paper we propose a scheme for developing a voice conversion system that converts the speech signal uttered by a source speaker to a speech signal having the voice characteristics of the target speaker. In particular, we address the issue of transformation of the vocal tract system features from one speaker to another. Formants are used to represent the vocal tract system features and a formant vocoder is used for synthesis. The scheme consists of a formant analysis phase, followed by a learning phase in which the implicit formant transformation is captured by a neural network. The transformed formants together with the pitch contour modified to suit the average pitch of the target speaker are used to synthesize speech with the desired vocal tract system characteristics.
Neural Networks | 2002
B. Yegnanarayana; S.P. Kishore
The objective in any pattern recognition problem is to capture the characteristics common to each class from feature vectors of the training data. While Gaussian mixture models appear to be general enough to characterize the distribution of the given data, the model is constrained by the fact that the shape of the components of the distribution is assumed to be Gaussian, and the number of mixtures are fixed a priori. In this context, we investigate the potential of non-linear models such as autoassociative neural network (AANN) models, which perform identity mapping of the input space. We show that the training error surface realized by the neural network model in the feature space is useful to study the characteristics of the distribution of the input data. We also propose a method of obtaining an error surface to match the distribution of the given data. The distribution capturing ability of AANN models is illustrated in the context of speaker verification.
IEEE Transactions on Acoustics, Speech, and Signal Processing | 1984
B. Yegnanarayana; Dilip Kr. Saikia; T. R. Krishnan
In this paper we discuss the problem of signal reconstruction from spectral magnitude or phase using group delay functions. We define two separate group delay functions for a signal, one is derived from the magnitude and the other from the phase of the Fourier transform of the signal. The group delay functions offer insight into the problem of signal reconstruction and suggest methods for reconstructing signals from partial information such as spectral magnitude or phase. We examine the problem of signal reconstruction from spectral magnitude or phase on the basis of these two group delay functions and derive the conditions for signal reconstruction. Based on existing iterative and noniterative algorithms for signal reconstruction, we propose new algorithms for some special classes of signals. The algorithms are illustrated with several examples. Our study shows that the relative importance of spectral magnitude and phase depends on the nature of signals. Speech signals are used to illustrate the importance of spectral magnitude and picture signals are used to illustrate the importance of phase in signal reconstruction problems. Using the group delay functions, we explain the convergence behavior of the existing iterative algorithms for signal reconstruction.
IEEE Transactions on Audio, Speech, and Language Processing | 2006
K.S. Rao; B. Yegnanarayana
Prosody modification involves changing the pitch and duration of speech without affecting the message and naturalness. This paper proposes a method for prosody (pitch and duration) modification using the instants of significant excitation of the vocal tract system during the production of speech. The instants of significant excitation correspond to the instants of glottal closure (epochs) in the case of voiced speech, and to some random excitations like onset of burst in the case of nonvoiced speech. Instants of significant excitation are computed from the linear prediction (LP) residual of speech signals by using the property of average group-delay of minimum phase signals. The modification of pitch and duration is achieved by manipulating the LP residual with the help of the knowledge of the instants of significant excitation. The modified residual is used to excite the time-varying filter, whose parameters are derived from the original speech signal. Perceptual quality of the synthesized speech is good and is without any significant distortion. The proposed method is evaluated using waveforms, spectrograms, and listening tests. The performance of the method is compared with linear prediction pitch synchronous overlap and add (LP-PSOLA) method, which is another method for prosody manipulation based on the modification of the LP residual. The original and the synthesized speech signals obtained by the proposed method and by the LP-PSOLA method are available for listening at http://speech.cs.iitm.ernet.in/Main/result/prosody.html.