Hemant A. Patil
Dhirubhai Ambani Institute of Information and Communication Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Hemant A. Patil.
international conference on acoustics, speech, and signal processing | 2014
Nirmesh J. Shah; Bhavik Vachhani; Hardik B. Sailor; Hemant A. Patil
In this paper, use of Viterbi-based algorithm and spectral transition measure (STM)-based algorithm for the task of speech data labeling is being attempted. In the STM framework, we propose use of several spectral features such as recently proposed cochlear filter cepstral coefficients (CFCC), perceptual linear prediction cepstral coefficients (PLPCC) and RelAtive SpecTrAl (RASTA)-based PLPCC in addition to Mel frequency cepstral coefficients (MFCC) for phonetic segmentation task. To evaluate effectiveness of these segmentation algorithms, we require manual accurate phoneme-level labeled data which is not available for low resourced languages such as Gujarati (one of the official languages of India). In order to measure effectiveness of various segmentation algorithms, HMM-based speech synthesis system (HTS) for Gujarati has been built. From the subjective and objective evaluations, it is observed that Viterbi-based and STM with PLPCC-based segmentation algorithms work better than other algorithms.
international symposium on chinese spoken language processing | 2014
Anshu Chittora; Hemant A. Patil
In this paper, feature derived from modulation spectrogram is proposed for the classification of pathological infant cries. In our work, two pathologies are considered, viz., asthma and hypoxy ischemic encephalopathy (HIE). Modulation spectrogram features are arranged in a form of tensor which is then reduced in its dimensions using Higher Order Singular Value Decomposition Theorem (HOSVD). The reduced feature set is used for classification of pathological infant cries using support vector machine (SVM) classifier with radial basis function (RBF) kernel. The classifier gives a mean classification accuracy of 76.23 % with the proposed feature set. The same experimental setup is used for the conventional feature set, i.e., Mel frequency cepstral coefficients (MFCC). MFCC shows a classification accuracy of 64.43 %. It is also observed that the proposed approach is robust against signal degradation conditions.
international conference on asian language processing | 2014
Purushotam G. Radadia; Hemant A. Patil
Singer IDentification (SID) is a very challenging problem in Music Information Retrieval (MIR) system. Instrumental accompaniments, quality of recording apparatus and other singing voices (in chorus) make SID very difficult and challenging research problem. In this paper, we propose SID system on large database of 500 Hindi (Bollywood) songs using state-of-the-art Mel Frequency Cepstral Coefficients (MFCC) and Cepstral Mean Subtracted (CMS) features. We compare the performance of 3rd order polynomial classifier and Gaussian Mixture Model (GMM). With 3rd order polynomial classifier, we achieved % SID accuracy of 78 % and 89.5 % (and Equal Error Rate (EER) of 6.75 % and 6.42 %) for MFCC and CMSMFCC, respectively. Furthermore, score-level fusion of MFCC and CMSMFCC reduced EER by 0.95 % than MFCC alone. On the other hand, GMM gave % SID accuracy of 70.75 % for both MFCC and CMSMFCC. Finally, we found that CMS-based features are effective to alleviate album effect in SID problem.
international symposium on chinese spoken language processing | 2014
Hardik B. Sailor; Hemant A. Patil
This paper analyzes the distance-based objective measures for evaluation of Text-to-Speech (TTS) systems (which is generally used objective measures). In this paper, we discuss some aspects of evaluation of speech quality of synthesized speech. Some of the limitations and issues of subjective evaluation are discussed and importance of objective measures is presented. Traditional objective measure using Dynamic Time Warping (DTW) distance is used in this work. We have used magnitude and phase-based features as well as auditory features to check effectiveness of objective measures for predicting quality of TTS voice. In particular, Mel Frequency Cepstral Coefficients (MFCC) features and phase-based Modified Group Delay-based Cepstral Coefficients (MGDCC) alone have no good correlation with subjective scores. However, feature-level fusion of MFCC and MGDCC gives better correlation than all other feature sets. With this fusion, we obtained value of correlation coefficient, -0.3 and -0.32 for Blizzard Challenge databases 2010 and 2011, respectively. The results also show significance of phase-based features for objective measures when used along with magnitude-based features. The experimental results show that distance-based measures still do not work well with Blizzard Challenge databases and need more general objective measures for measuring quality of TTS voice.
international conference on asian language processing | 2014
Mohammadi Zaki; J. Nirmesh Shah; Hemant A. Patil
Phonetic segmentation plays a key role in developing various speech applications. In this work, we propose to use various features for automatic phonetic segmentation task for forced Viterbi alignment and compare their effectiveness. We propose to use novel multiscale fractal dimension-based features concatenated with Mel-Frequency Cepstral Coefficients (MFCC). The novel features are expected to capture additional nonlinearities in speech production which should improve the performance of segmentation task. However, to evaluate effectiveness of these segmentation algorithms, we require manual accurate phoneme-level labeled data which is not available for low resource languages such as Gujarati (a low resource language and one of the official languages of India). In order to measure effectiveness of various segmentation algorithms, HMM-based speech synthesis system (HTS) for Gujarati have been built. From the subjective and objective evaluations, it is observed that FD-based features for segmentation work moderately better than other state-of-the-art features such as MFCC, Perceptual Linear Prediction Cepstral Coefficients (PLP-CC), Cochlear Filter Cepstral Coefficients (CFCC), and RelAtive SpecTrAl (RASTA)-based PLP-CC. The Mean Opinion Score (MOS) and the Degraded-MOS, which are the measures of naturalness indicate an improvement of 9.69% with the proposed features from the MFCC (which is found to be the best among the other features) based features.
international conference on asian language processing | 2014
Bhavik Vachhani; Kewal D. Malde; Maulik C. Madhavi; Hemant A. Patil
Obstruents are the key landmark events found in the speech signal. In this paper, we propose use of spectral transition measure (STM) to locate the obstruents in the continuous speech. The proposed approach does not take in to account any prior information (like phonetic sequence, speech transcription, and number of obstruents in the speech). Hence this approach is unsupervised and unconstraint approach. In this paper, we propose use of state-of-the-art Mel Frequency Cepstral Coefficients (MFCC)-based features to capture spectral transition for obstruent detection task. It is expected more spectral transition in the vicinity of obstruents. The entire experimental setup is developed on TIMIT database. The detection efficiency and estimated probability are around 77 % and 0.77 respectively (with 30 ms agreement duration and 0.4 STM threshold).
international symposium on chinese spoken language processing | 2014
Mohammadi Zaki; Nirmesh J. Shah; Hemant A. Patil
We propose to use multiscale fractal dimension (MFD) as components of feature vectors for automatic speech recognition (ASR) especially in low resource languages. Speech, which is known to be a nonlinear process, can be efficiently represented by extracting some nonlinear properties, such as fractal dimension, from the speech segment. During speech production, vortices (generated due to presence of separated airflow) may travel along the vocal tract and excite vocal tract resonators at the epiglottis, velum, palate, teeth, lips, etc. By Kolmogorovs law, the gradient in energy levels between these vortices produces turbulence. This ruggedness, and in effect, the embedded features of different phoneme classes, can be captured by invariant property of FD. Furthermore, speech is a multifractal, which justifies the use of multiscale fractal dimension as feature components for speech. In this paper, we describe the multifractal nature of speech signal and use this property for automatic phonetic segmentation task. The results show a significant decrease in % EER (≈ 4.2 % from traditional MFCC base features and ≈ 2.5 % from MFCC appended with 1-D fractal dimension). The DET curves clearly show improvement in the performance with the new multiscale fractal dimension-based features for low resource language under consideration.
international symposium on chinese spoken language processing | 2014
Nirmesh J. Shah; Hemant A. Patil; Maulik C. Madhavi; Hardik B. Sailor; Tanvina B. Patel
The generalized statistical framework of Hidden Markov Model (HMM) has been successfully applied from the field of speech recognition to speech synthesis. In this work, we have applied HMM-based Speech Synthesis System (HTS) method to Gujarati language. Adaption and evaluation of HTS for Gujarati language has been done here. Evaluation of HTS system built using Gujarati data is done in terms of naturalness and speech intelligibility. Apart from this, a conventional EM algorithm-based HTS and recently proposed Deterministic Annealing EM algorithm-based HTS has been applied to Gujarati (a low resourced language) and its relative comparison has been done. It has been found that HTS in Gujarati has very high intelligibility. It was verified from AB-test that 70.5 % times DAEM-based HTS has preferred over EM-based HTS developed for Gujarati language.
international symposium on chinese spoken language processing | 2014
Maulik C. Madhavi; Hemant A. Patil
In this paper, we attempt voice biometrics problem using only humming signal rather than normal speech. This paper adapts a new feature extraction technique which exploits Variable length Teager Energy Operator (VTEO) onto subband filtered signal of Mel filterbank. This feature modifies structure of state-of-the-art feature set, viz., Mel Frequency Cepstral Coefficients (MFCC). In particular, a new energy measure, viz., VTEO is employed to compute subband energies of different time-domain subband signals. The features derived MFCCs to capture magnitude and phase spectrum information via VTEO are termed as MFCC-VTMP. Discriminatively-trained polynomial classifier of 2nd order approximations is used as the basis for all experiments. MFCC-VTMP feature set is found to be better than MFCC for various evaluation factors such as order of polynomial classifier, dimension of feature vector, signal degradation conditions and class separability. % EER of MFCC and MFCC-VTMP are found to be 12.20% and 12.01%, respectively using 2nd order polynomial classification.
international conference on asian language processing | 2014
Anshu Chittora; Hemant A. Patil
In this paper, features extracted from modulation spectrogram are used to classify the phonemes in Gujarati language. Modulation spectrogram which is a 2-dimensional (i.e., 2-D) feature vector, is then reduced to a smaller feature dimension by using the proposed feature extraction method. Gujarati database was manually segmented in 31 phoneme classes. These phonemes are then classified using support vector machine (SVM) classifier. Classification accuracy of phoneme classification is 94.5 % as opposed to classification with the state-of-the-art feature set Mel frequency cepstral coefficients (MFCC), which yields 92.74 % classification accuracy. Classification accuracy for broad phoneme classes, viz., vowel, stops, nasals, semivowels, affricates and fricatives is also determined. Phoneme classification in their respective classes is 95.03 % correct with the proposed feature set. Fusion of MFCC with the proposed feature set is performing even better, giving phoneme classification accuracy of 95.7%. With the fusion of features phoneme classification in sonorant and obstruent classes is found to be 97.01 % accurate.
Collaboration
Dive into the Hemant A. Patil's collaboration.
Dhirubhai Ambani Institute of Information and Communication Technology
View shared research outputsDhirubhai Ambani Institute of Information and Communication Technology
View shared research outputsDhirubhai Ambani Institute of Information and Communication Technology
View shared research outputsDhirubhai Ambani Institute of Information and Communication Technology
View shared research outputsDhirubhai Ambani Institute of Information and Communication Technology
View shared research outputsDhirubhai Ambani Institute of Information and Communication Technology
View shared research outputsDhirubhai Ambani Institute of Information and Communication Technology
View shared research outputsDhirubhai Ambani Institute of Information and Communication Technology
View shared research outputsDhirubhai Ambani Institute of Information and Communication Technology
View shared research outputs