Maulik C. Madhavi
Dhirubhai Ambani Institute of Information and Communication Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Maulik C. Madhavi.
international conference on asian language processing | 2012
Hemant A. Patil; Maulik C. Madhavi; Kewal D. Malde; Bhavik Vachhani
This paper addresses phonetic transcription related issues in Gujarati and Marathi (Indian Languages). Some adhoc approaches to fix relationship between the general alphabetical symbols and phonetic symbols may not always work. Hence, some research issues like ambiguity between frication and aspirated plosive are addressed in this paper. The anusvara in both of these languages are produced based on the immediate following consonant. Implication for this finding for the problem of phonetic transcription is presented. Furthermore, the effect of dialectal variations on phonetic transcription is also analyzed for Marathi. Finally, some examples of phonetic transcription for sentences of these two languages are presented.
international conference on biometrics | 2012
Hemant A. Patil; Maulik C. Madhavi
In this paper, recognition of persons is attempted from their hum. This kind of application can be useful to design humming-based biometrics system or person-dependent Query-by-Humming (QBH) system and hence play an important role in music information retrieval (MIR) system. This paper develops a new feature extraction technique to exploit phase spectrum information along with magnitude spectrum information from hum signal. In particular, structure of state-of-the-art feature set, viz., Mel Frequency Cepstral Coefficients (MFCC) is modified to capture the phase spectrum information. In addition, a new energy measure, viz., Variable length Teager Energy Operator (VTEO) is employed to compute subband energies of different time-domain subband signals (i.e., output of 24 triangular shaped filters used in Mel filterbank). Discriminatively-trained polynomial classifier of 2nd order approximations are used as the basis for recognition experiment.
international conference oriental cocosda held jointly with conference on asian spoken language research and evaluation | 2013
Kewal D. Malde; Bhavik Vachhani; Maulik C. Madhavi; Nirav H. Chhayani; Hemant A. Patil
There have been growing interest to use speech technology for rural areas. In this context, this paper describes the development of speech corpora in Indian languages (viz., Gujarati and Marathi from remote villages) for the task of phonetic transcription. This paper also presents related analysis of phonetic transcription. The manual phonetic transcription was done for two Indian languages, viz., Gujarati and Marathi for 8 hours of field recorded speech data in real-life settings. Dialectal variations are also analyzed using spectrograms and phonetic transcription. In addition, it was found that for consonant sounds, plosive sounds are having large coverage in broad phonetic category. The collected speech corpora can be very useful for speech and speaker recognition tasks.
Computer Speech & Language | 2017
Maulik C. Madhavi; Hemant A. Patil
Query-by-Example approach of spoken content retrieval has gained much attention because of its feasibility in the absence of speech recognition and its applicability in a multilingual matching scenario. This approach to retrieve spoken content is referred to as Query-by-Example Spoken Term Detection (QbE-STD). The state-of-the-art QbE-STD system performs matching between the frame sequence of query and test utterance via Dynamic Time Warping (DTW) algorithm. In realistic scenarios, there is a need to retrieve the query which does not appear exactly in the spoken document. However, the appeared instance of query might have the different suffix, prefix or word order. The DTW algorithm monotonically aligns the two sequences and hence, it is not suitable to perform partial matching between the frame sequence of query and test utterance. In this paper, we propose novel partial matching approach between spoken query and utterance using modified DTW algorithm where multiple warping paths are constructed for each query and test utterance pair. Next, we address the research issue associated with search complexity of DTW and suggest two approaches, namely, feature reduction approach and Bag-of-Acoustic-Words (BoAW) model. In feature reduction approach, the number of feature vectors is reduced by averaging across the consecutive frames within phonetic boundaries. Thus, a lesser number of feature vectors require fewer number of comparisons and hence, DTW speeds up the search computation. The search computation time gets reduced by 4649% with a slight degradation in performance as compared to no feature reduction case. In BoAW model, we construct term frequency-inverse document frequency (tfidf) vectors at segment-level to retrieve audio documents. The proposed segment-level BoAW model is used to match test utterance with a query using (tfidf) vectors and the scores obtained are used to rank the test utterance. The BoAW model gave more than 80% recall value on 70% top retrieval. To re-score the detection, we further employ DTW search or modified DTW search to retrieve the spoken query from the selected utterances using BoAW model. QbE-STD experiments are conducted on different international benchmarks, namely, MediaEval spoken web search SWS 2013 and MediaEval query-by-example search on speech QUESST 2014.
international conference on signal processing | 2016
Maulik C. Madhavi; Hemant A. Patil
Query-by-Example Spoken Term Detection (QbE-STD) under low-resource settings, is the task of retrieval which can be done via the example of an audio. The searching phase involves highly computationally intensive Dynamic Time Warping (DTW)-based matching techniques. Search space reduction is an important need in order to reduce the space of searching and hence, reduce the computational complexity. In this paper, to perform DTW in a faster mode, the average of consecutive features is considered without overlapping. Much of the information is lost during feature reduction process. For instance, the posterior features on either side of phone boundaries exhibit characteristics. Hence, one such loss might be introduced due to the merging of feature vectors in the vicinity of phoneme boundaries. To overcome this, we perform merging of features after considering the phoneme boundaries (detected using spectral transition measure). The QbE-STD task is performed on MediaEval SWS 2013 dataset. The presented approach reduces the computation time by 46.15% to 49.16 % with very low-performance degradation, i.e., 0.017-0.023 in Maximum Term Weight Value (MTWV) with respect to no feature reduction.
PerMIn'12 Proceedings of the First Indo-Japan conference on Perception and Machine Intelligence | 2012
Hemant A. Patil; Maulik C. Madhavi; Rahul Jain; Alok K. Jain
In this paper, hum of a person is used to identify a speaker with the help of machine. In addition, novel temporal features (such as zero-crossing rate & short-time energy) and spectral features (such as spectral centroid & spectral flux) are proposed for person recognition task. Feature-level fusion of each of these features with state-of-the art spectral feature set, viz ., Mel Frequency Cepstral Coefficients (MFCC) is found to give better recognition performance than MFCC alone. In addition, it is shown that the person identification rate is competitive over baseline MFCC. Furthermore, the reduction in equal error rate (EER) by 1.46 % is obtained when a feature-level fusion system is employed by combining evidences from MFCC, temporal and proposed spectral features.
european signal processing conference | 2015
Maulik C. Madhavi; Hemant A. Patil; Bhavik Vachhani
Obstruents are very important acoustical events (i.e., abrupt-consonantal landmarks) in the speech signal. This paper presents the use of novel Spectral Transition Measure (STM) to locate the obstruents in the continuous speech signal. The problem of obstruent detection involves detection of phonetic boundaries associated with obstruent sounds. In this paper, we propose use of STM information derived from state-of-the-art Mel Frequency Cepstral Coefficients (MFCC) feature set and newly developed feature set, viz., MFCC-TMP (which uses Teager Energy Operator (TEO) to exploit implicitly Magnitude and Phase information in the MFCC framework) for obstruent detection. The key idea here is to exploit capabilities of STM to capture high dynamic transitional characteristics associated with obstruent sounds. The experimental setup is developed on entire TIMIT database. For 20 ms agreement (tolerance) duration, obstruent detection rate is found to be 97.59 % with 17.65 % false acceptance using state-of-the-art MFCC-STM and 96.42 % with 12.88 % false acceptance using MFCC-TMP-STM. Finally, STM-based features along with static representation (i.e., MFCC-STM and MFCC-TMP-STM) are evaluated for phone recognition task.
international conference on asian language processing | 2014
Bhavik Vachhani; Kewal D. Malde; Maulik C. Madhavi; Hemant A. Patil
Obstruents are the key landmark events found in the speech signal. In this paper, we propose use of spectral transition measure (STM) to locate the obstruents in the continuous speech. The proposed approach does not take in to account any prior information (like phonetic sequence, speech transcription, and number of obstruents in the speech). Hence this approach is unsupervised and unconstraint approach. In this paper, we propose use of state-of-the-art Mel Frequency Cepstral Coefficients (MFCC)-based features to capture spectral transition for obstruent detection task. It is expected more spectral transition in the vicinity of obstruents. The entire experimental setup is developed on TIMIT database. The detection efficiency and estimated probability are around 77 % and 0.77 respectively (with 30 ms agreement duration and 0.4 STM threshold).
Computer Speech & Language | 2017
Hemant A. Patil; Maulik C. Madhavi
Abstract Most of the state-of-the-art speaker recognition system use natural speech signal (i.e., real speech, spontaneous speech or contextual speech) from the subjects. In this paper, recognition of a person is attempted from his or her hum with the help of machines. This kind of application can be useful to design person-dependent Query-by-Humming (QBH) system and hence, plays an important role in music information retrieval (MIR) system. In addition, it can be also useful for other interesting speech technological applications such as human-computer interaction, speech prosody analysis of disordered speech, and speaker forensics. This paper develops new feature extraction technique to exploit perceptually meaningful (due to mel frequency warping to imitate human perception process for hearing) phase spectrum information along with magnitude spectrum information from the hum signal. In particular, the structure of state-of-the-art feature set, namely, Mel Frequency Cepstral Coefficients (MFCCs) is modified to capture the phase spectrum information. In addition, a new energy measure, namely, Variable length Teager Energy Operator (VTEO) is employed to compute subband energies of different time-domain subband signals (i.e., an output of 24 triangular-shaped filters used in the mel filterbank). We refer this proposed feature set as MFCC-VTMP (i.e., mel frequency cepstral coefficients to capture perceptually meaningful magnitude and phase information via VTEO)The polynomial classifier (which is in-principle similar to other discriminatively-trained classifiers such as support vector machine (SVM) with polynomial kernel) is used as the basis for all the experiments. The effectiveness of proposed feature set is evaluated and consistently found to be better than MFCCs feature set for several evaluation factors, such as, comparison with other phase-based features, the order of polynomial classifier, person (speaker) modeling approach (such as, GMM-UBM and i-vector), the dimension of feature vector, robustness under signal degradation conditions, static vs. dynamic features, feature discrimination measures and intersession variability.
international symposium on chinese spoken language processing | 2014
Nirmesh J. Shah; Hemant A. Patil; Maulik C. Madhavi; Hardik B. Sailor; Tanvina B. Patel
The generalized statistical framework of Hidden Markov Model (HMM) has been successfully applied from the field of speech recognition to speech synthesis. In this work, we have applied HMM-based Speech Synthesis System (HTS) method to Gujarati language. Adaption and evaluation of HTS for Gujarati language has been done here. Evaluation of HTS system built using Gujarati data is done in terms of naturalness and speech intelligibility. Apart from this, a conventional EM algorithm-based HTS and recently proposed Deterministic Annealing EM algorithm-based HTS has been applied to Gujarati (a low resourced language) and its relative comparison has been done. It has been found that HTS in Gujarati has very high intelligibility. It was verified from AB-test that 70.5 % times DAEM-based HTS has preferred over EM-based HTS developed for Gujarati language.
Collaboration
Dive into the Maulik C. Madhavi's collaboration.
Dhirubhai Ambani Institute of Information and Communication Technology
View shared research outputsDhirubhai Ambani Institute of Information and Communication Technology
View shared research outputsDhirubhai Ambani Institute of Information and Communication Technology
View shared research outputsDhirubhai Ambani Institute of Information and Communication Technology
View shared research outputsDhirubhai Ambani Institute of Information and Communication Technology
View shared research outputsDhirubhai Ambani Institute of Information and Communication Technology
View shared research outputsDhirubhai Ambani Institute of Information and Communication Technology
View shared research outputsDhirubhai Ambani Institute of Information and Communication Technology
View shared research outputs