Hema A. Murthy
Indian Institute of Technology Madras
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Hema A. Murthy.
Speech Communication | 1995
M. Narendranath; Hema A. Murthy; S. Rajendran; B. Yegnanarayana
In this paper we propose a scheme for developing a voice conversion system that converts the speech signal uttered by a source speaker to a speech signal having the voice characteristics of the target speaker. In particular, we address the issue of transformation of the vocal tract system features from one speaker to another. Formants are used to represent the vocal tract system features and a formant vocoder is used for synthesis. The scheme consists of a formant analysis phase, followed by a learning phase in which the implicit formant transformation is captured by a neural network. The transformed formants together with the pitch contour modified to suit the average pitch of the target speaker are used to synthesize speech with the desired vocal tract system characteristics.
international conference on acoustics, speech, and signal processing | 2003
Hema A. Murthy; Venkata Ramana Rao Gadde
We explore a new spectral representation of speech signals through group delay functions. The group delay functions by themselves are noisy and difficult to interpret owing to zeroes that are close to the unit circle in the z-domain and these clutter the spectra. A new modified group delay function (Yegnanarayan, B. and Murthy, H.A., IEEE Trans. Sig. Processing, vol.40, p.2281-9, 1992) that reduces the effects of zeroes close to the unit circle is used. Assuming that this new function is minimum phase, the modified group delay spectrum is converted to a sequence of cepstral coefficients. A preliminary phoneme recogniser is built using features derived from these cepstra. Results are compared with those obtained from features derived from the traditional mel frequency cepstral coefficients (MFCC). The baseline MFCC performance is 34.7%, while that of the best modified group delay cepstrum is 39.2%. The performance of the composite MFCC feature, which includes the derivatives and double derivatives, is 60.7%, while that of the composite modified group delay feature is 57.3%. When these two composite features are combined, /spl sim/2% improvement in performance is achieved (62.8%). When this new system is combined with linear frequency cepstra (LFC) (Gadde, V.R.R. et al., The SRI SPINE 2001 Evaluation System. http://elazar.itd.nrl.navy.mil/spine/sri2/presentation/sri2001.html, 2001), the system performance results in another /spl sim/0.8% improvement (63.6%).
IEEE Transactions on Signal Processing | 1992
B. Yegnanarayana; Hema A. Murthy
A method of spectrum estimation using group delay functions is proposed. This method exploits the additive property of the Fourier transform (FT) phase to extract spectral information of the signal in the presence of noise. The phase is generally featureless due to random polarity and wrappings, but the group delay function can be processed to derive significant information such as peaks in the spectral envelope. In the resulting spectral estimates obtained the resolution properties of the periodogram estimate are preserved while the variance is reduced. Variance caused by the sidelobe leakage due to windows and additive noise are significantly reduced even in the spectral estimate obtained using a single realization of the observation peak. Resolution is primarily dictated by the size of the data window. The method works even for high noise levels. The results of this procedure are demonstrated through two illustrative examples: estimation of sinusoids in noise and estimation of the narrowband autoregressive process in noise. >
IEEE Transactions on Audio, Speech, and Language Processing | 2007
Rajesh M. Hegde; Hema A. Murthy; Venkata Ramana Rao Gadde
Spectral representation of speech is complete when both the Fourier transform magnitude and phase spectra are specified. In conventional speech recognition systems, features are generally derived from the short-time magnitude spectrum. Although the importance of Fourier transform phase in speech perception has been realized, few attempts have been made to extract features from it. This is primarily because the resonances of the speech signal which manifest as transitions in the phase spectrum are completely masked by the wrapping of the phase spectrum. Hence, an alternative to processing the Fourier transform phase, for extracting speech features, is to process the group delay function which can be directly computed from the speech signal. The group delay function has been used in earlier efforts, to extract pitch and formant information from the speech signal. In all these efforts, no attempt was made to extract features from the speech signal and use them for speech recognition applications. This is primarily because the group delay function fails to capture the short-time spectral structure of speech owing to zeros that are close to the unit circle in the z-plane and also due to pitch periodicity effects. In this paper, the group delay function is modified to overcome these effects. Cepstral features are extracted from the modified group delay function and are called the modified group delay feature (MODGDF). The MODGDF is used for three speech recognition tasks namely, speaker, language, and continuous-speech recognition. Based on the results of feature and performance evaluation, the significance of the MODGDF as a new feature for speech recognition is discussed
Speech Communication | 2004
V. Kamakshi Prasad; T. Nagarajan; Hema A. Murthy
Abstract In this paper, we present a new algorithm to automatically segment a continuous speech signal into syllable-like segments. The algorithm for segmentation is based on processing the short-term energy function of the continuous speech signal. The short-term energy function is a positive function and can therefore be processed in a manner similar to that of the magnitude spectrum. In this paper, we employ an algorithm, based on group delay processing of the magnitude spectrum to determine segment boundaries in the speech signal. The experiments have been carried out on TIMIT and TIDIGITS databases. The error in segment boundary is ⩽20% of syllable duration for 70% of the syllables. In addition to true segments, an overall 5% insertions and deletions have also been observed.
Signal Processing | 1991
Hema A. Murthy; B. Yegnanarayana
In this paper we demonstrate the feasibility of processing the Fourier transform (FT) phase of a speech signal to derive the smooth log magnitude spectrum corresponding to the vocal tract system. We exploit the additive property of the group delay function (negative derivative of the FT phase) to process the FT phase. We show that the rapid fluctuations in the log magnitude spectrum and the group delay function are caused by the zeroes of the z-transform of the excitation components of the speech signal. Zeroes close to the unit circle in the z-plane produce large amplitude spikes in the group delay function and mask the group delay information corresponding to the vocal tract system. We propose a technique to extract the vocal tract system component of the group delay function by using the spectral properties of the excitation signal.
international conference on acoustics, speech, and signal processing | 2004
T. Nagarajan; Hema A. Murthy
Automatic spoken language identification (LID) is the task of identifying the language from a short utterance of the speech signal. The most successful approach to LID uses phone recognizers of several languages in parallel. The basic requirement to build a parallel phone recognition (PPR) system is annotated corpora. A novel approach is proposed for the LID task which uses parallel syllable-like unit recognizers, in a framework similar to the PPR approach in the literature. The difference is that unsupervised syllable models are built from the training data. The data is first segmented into syllable-like units. The syllable segments are then clustered using an incremental approach. This results in a set of syllable models for each language. Our initial results on the OGI MLTS corpora show that the performance is 69.5%. We further show that if only a subset of syllable models that are unique (in some sense), are considered, the performance improves to 75.9%.
international conference on acoustics, speech, and signal processing | 2004
Rajesh M. Hegde; Hema A. Murthy; Gadde V. Ramana Rao
In this paper, we explore new methods by which speakers can be identified and discriminated, using features derived from the Fourier transform phase. The modified group delay feature (MODGDF) which is a parameterized form of the modified group delay function is used as a front end feature in this study. A Gaussian mixture model (GMM) based speaker identification system is built with the MODGDF as the front end feature. The system is tested on both clean (TIMIT) and noisy telephone (NTIMIT) speech. The results obtained are compared with traditional Mel frequency cepstral coefficients (MFCC) which is derived from the Fourier transform magnitude. When both MFCC and MODGDF were combined, the performance improved by about 4% indicating that both phase and magnitude contain complementary information. In an earlier paper (Murthy et al. (2003)), it was shown that the MODGDF does possess phoneme specific characteristics. In this paper we show that the MODGDF has speaker specific properties. We also make an attempt to understand speaker discriminating characteristics of the MODGDF using the nonlinear mapping technique based on Sammon mapping (Sammon (1969)) and find that the MODGDF empirically demonstrates a certain level of linear separability among speakers.
EURASIP Journal on Advances in Signal Processing | 2004
T. Nagarajan; Hema A. Murthy
In the development of a syllable-centric automatic speech recognition (ASR) system, segmentation of the acoustic signal into syllabic units is an important stage. Although the short-term energy (STE) function contains useful information about syllable segment boundaries, it has to be processed before segment boundaries can be extracted. This paper presents a subband-based group delay approach to segment spontaneous speech into syllable-like units. This technique exploits the additive property of the Fourier transform phase and the deconvolution property of the cepstrum to smooth the STE function of the speech signal and make it suitable for syllable boundary detection. By treating the STE function as a magnitude spectrum of an arbitrary signal, a minimum-phase group delay function is derived. This group delay function is found to be a better representative of the STE function for syllable boundary detection. Although the group delay function derived from the STE function of the speech signal contains segment boundaries, the boundaries are difficult to determine in the context of long silences, semivowels, and fricatives. In this paper, these issues are specifically addressed and algorithms are developed to improve the segmentation performance. The speech signal is first passed through a bank of three filters, corresponding to three different spectral bands. The STE functions of these signals are computed. Using these three STE functions, three minimum-phase group delay functions are derived. By combining the evidence derived from these group delay functions, the syllable boundaries are detected. Further, a multiresolution-based technique is presented to overcome the problem of shift in segment boundaries during smoothing. Experiments carried out on the Switchboard and OGI-MLTS corpora show that the error in segmentation is at most 25 milliseconds for 67% and 76.6% of the syllable segments, respectively.
Journal of New Music Research | 2014
Preeti Rao; Joe Cheri Ross; Kaustuv Kanti Ganguli; Vedhas Pandit; Vignesh Ishwar; Ashwin Bellur; Hema A. Murthy
Abstract Ragas are characterized by their melodic motifs or catch phrases that constitute strong cues to the raga identity for both the performer and the listener, and therefore are of great interest in music retrieval and automatic transcription. While the characteristic phrases, or pakads, appear in written notation as a sequence of notes, musicological rules for interpretation of the phrase in performance in a manner that allows considerable creative expression, while not transgressing raga grammar, are not explicitly defined. In this work, machine learning methods are used on labelled databases of Hindustani and Carnatic vocal audio concerts to obtain phrase classification on manually segmented audio. Dynamic time warping and HMM based classification are applied on time series of detected pitch values used for the melodic representation of a phrase. Retrieval experiments on raga-characteristic phrases show promising results while providing interesting insights on the nature of variation in the surface realization of raga-characteristic motifs within and across concerts.