Debadatta Pati
Indian Institute of Technology Guwahati
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Debadatta Pati.
national conference on communications | 2010
Debadatta Pati; S. R. M. Prasanna
The objective of this work is to demonstrate the significant speaker information present in the subband energies of the Linear Prediction (LP) residual. The LP residual mostly contains the excitation source information. The subband energies extracted using the mel filterbank followed by cepstral analysis provides a compact representation. The resulting cepstral values are termed as Residual-mel Frequency Cepstral Coefficients (R-MFCC). The speaker identification studies conducted using R-MFCC as features and Gaussian mixture model (GMM) on a subset of 30 speakers from NIST-1999 provides 87% accuracy. The performance using MFCC extracted directly from speech provides 87% accuracy. Further, the combination of the two provides 90% accuracy indicating the different aspect of speaker information present in R-MFCC.
ieee region 10 conference | 2008
Debadatta Pati; S. R. M. Prasanna
The objective of this work is to demonstrate the feasibility of excitation source information obtained by non-parametric vector quantization (VQ) for speaker recognition task. Linear prediction (LP) residual is used as the representation of excitation source information. The LP residual is subjected to non-parametric VQ during training. The codebooks are built for different codebook sizes. The testing of these codebooks using the LP residual of testing speech data indeed demonstrates that a codebook of sufficiently large size uniquely represents the speaker and provides appreciable performance. The speaker recognition system built using conventional Mel frequency cepstral coefficients (MFCCs) representing vocal tract information combines well with the proposed speaker recognition system using excitation source information to provide improved performance. On a set of randomly chosen 30 speakers from the TIMIT database, the proposed system provides 75%, MFCC based system provides 95% and the combined one provides 98.33%.
national conference on communications | 2015
Rohan Kumar Das; Debadatta Pati; S. R. Mahadeva Prasanna
Limited data speaker verification has shown its significance in practical system oriented applications. The paper shows the importance of different aspects of voice source feature for limited test data scenario. A baseline speaker verification system using conventional mel frequency cepstral co-efficients (MFCC) feature is developed and performance under limited test data condition (≤10 s) is evaluated. A parallel system based on source feature mel power difference of spectrum in subband (M-PDSS) is developed in the i-vector based speaker verification framework. Both the systems were fused at the score level for the cases of short segments of test speech, which demonstrated the importance of source feature with reduction in test data duration. A comparative study of the M-PDSS feature is then made with our earlier work using discrete cosine transform of the integrated linear prediction residual (DCTILPR) feature and then fusion of two source features M-PDSS and DCTILPR along with MFCC features is carried out. An absolute improvement of 5.19% is obtained for 2 s of test data which conveys the significance of multiple source information under limited data speaker verification as it carries different aspects of source information.
Iete Technical Review | 2010
Debadatta Pati; S. R. Mahadeva Prasanna
Abstract This paper gives a survey of different explorations carried out using speaker information present in the excitation source of speech for speaker recognition. The paper begins with an overview of the speaker recognition task. This is followed by a discussion on different speaker information present in speech, feature extraction methods, and types of excitation sources for speech production. Detailed descriptions on different explorations to exploit the speaker information in the excitation source are then given. These include methods based on pitch contour, jitter, shimmer, glottal flow derivative, linear prediction (LP) residual, LP residual phase, LP residual cepstrum, harmonic structure of the LP residual spectrum, and time frequency analysis of LP residual. A comparative study of all these methods is then carried out to highlight their merits and demerits. The paper is concluded by mentioning a future direction for speaker recognition from excitation source perspective.
International Journal of Speech Technology | 2015
Dipanjan Nandi; Debadatta Pati; K. Sreenivasa Rao
In present work, the robustness of excitation source features has been analyzed for language identification (LID) task. The raw samples of linear prediction (LP) residual signal, its magnitude and phase components are processed at sub-segmental, segmental and supra-segmental levels for capturing the robust language-specific phonotactic information. Present LID study has been carried out on 27 Indian languages from Indian Institute of Technology Kharagpur-Multi Lingual Indian Language Speech Corpus (IITKGP-MLILSC). Gaussian mixture models are used to develop the LID systems using robust language-specific excitation source information. Robustness of excitation source information has been evinced in view of (i) background noise, (ii) varying amount of training data and (iii) varying length of test samples. Finally, the robustness of proposed excitation source features is compared with the well-known spectral features using LID performances obtained from IITKGP-MLILSC database. Segmental level excitation source features obtained from raw samples of LP residual signal and its phase component perform better at low SNR levels, compared with the vocal tract features.
Computer Speech & Language | 2017
Dipanjan Nandi; Debadatta Pati; K. Sreenivasa Rao
Excitation source information is explored for language identification.Implicit relations present in the LP residual samples are examined for LID task.The magnitude component of LP residual is explored for discriminating languages.The phase information present among LP residual samples is explored for LID task.Combined LID systems are developed using source features to enhance LID accuracy. Present work explores the excitation source information for the language identification (LID) task. In this work, excitation source information is captured by implicit processing of linear prediction (LP) residual signal for discriminating the languages. Raw samples of LP residual signal, its magnitude, and phase components are processed independently at sub-segmental, segmental and suprasegmental levels for extracting the language-specific excitation source information. The LID studies are carried out using 27 Indian languages from Indian Institute of Technology Kharagpur-Multi Lingual Indian Language Speech Corpus (IITKGP-MLILSC) and 11 international languages from OGI-MLTS corpus. The Gaussian mixture models (GMMs) are used in this work to model the language-specific excitation source information for LID task. From the experimental results, it can be observed that, features extracted from segmental level yields better identification accuracy (50.92%), compared to sub-segmental (47.77%) and suprasegmental levels (43.88%). Further, the evidence from all three levels is combined to obtain the complete excitation source information. Finally, we have investigated the existence of non-overlapping language-specific information present in excitation source and vocal tract features.
International Journal of Speech Technology | 2015
Debadatta Pati; S. R. M. Prasanna
In this work the linear prediction (LP) residual is processed in spectral and cepstral domains to model the speaker-specific excitation information. In the spectral domain, the excitation energy information is modeled from subband energies (SBE). The excitation periodicity information is modeled by power differences of spectrum in subband (PDSS) measure. This work carries some refinements in the existing methods of extracting SBE and PDSS by exploiting the nature of the excitation spectrum. The SBE and PDSS values are computed from mel warped residual subband spectrum and called as residual mel subband energies (R-MSE) and mel power differences of subband spectra (M-PDSS), respectively. The different speaker recognition studies performed using NIST-99 and NIST-03 databases demonstrate that R-MSE and M-PDSS features represent good speaker information. It is also demonstrated that the excitation energy information can be better modeled in the cepstral domain by residual mel frequency cepstral coefficients (R-MFCC). Furhter, the evidences provided by M-PDSS and R-MFCC features are different and combine well and provides improved recognition performance. The combined evidence from M-PDSS and R-MFCC together with the vocal tract information further improves the performance. Finally, a comparative study on processing the LP residual in temporal, spectral and cepstral domains demonstrates that with a small compromise with the recognition performance, processing LP residual in spectral and cepstral domains provide compact and effective way of representing the excitation information, as compared to temporal processing.
Computer Speech & Language | 2017
Dipanjan Nandi; Debadatta Pati; K. Sreenivasa Rao
Excitation source information is explored for language identification.RMFCC and MPDSS features represent segmental level language-specific information.GFD parameters capture sub-segmental level language-specific information.PC and ESC represent supra-segmental level excitation source information.Complementary information from source and system features is examined. In this work, the linear prediction (LP) residual signal has been parameterized to capture the excitation source information for language identification (LID) study. LP residual signal has been processed at three different levels: sub-segmental, segmental and supra-segmental levels to demonstrate different aspects of language-specific excitation source information. Proposed excitation source features have been evaluated on 27 Indian languages from Indian Institute of Technology Kharagpur-Multi Lingual Indian Language Speech Corpus (IITKGP-MLILSC), Oregon Graduate Institute Multi-Language Telephone-based Speech (OGI-MLTS) and National Institute of Standards and Technology Language Recognition Evaluation (NIST LRE) 2011 corpora. LID systems were developed using Gaussian mixture model (GMM) and i-vector based approaches. Experimental results have shown that segmental level parametric features provide better identification accuracy (62%), compared to sub-segmental (40%) and supra-segmental level (34%) features. Excitation source features obtained from three levels show distinct language-specific evidence. Therefore, the scores from all three levels are combined to obtain the complete excitation source information for the LID task. LID performances achieved from both the excitation source and vocal tract system are compared. Finally, the scores obtained by processing the vocal tract and excitation source features are combined to achieve better improvement in LID accuracy. The best recognition accuracies obtained from stage-IV integrated LID systems I, II and III are 69%, 70% and 72% respectively.
national conference on communications | 2017
Madhusudan Singh; Jagabandhu Mishra; Debadatta Pati
Spoofing automatic speaker verification systems by using pre-recorded speech samples is called as replay attack. The availability of high quality recording and replay devices (i.e. smart phones) has made replay attacks more easily accessible, even with minimal or no specific speech processing knowledge. This work demonstrates the usefulness of linear prediction (LP) residual signal for the development of replay attacks detection system. The level of discriminatory information available in LP residual signal is investigated and compared with recently proposed playback detection algorithm (PAD), that relies on the information present in speech signal spectrograms. We observed that LP residual spectra are comparatively distinguishable. A comparative study for speech and LP residual signals is performed by speaker verification experiments under replay attacks. Results show that information present in LP residual is relatively more robust and effective in reducing the false acceptance rate. We conclude, LP residual signals may equally be useful for the development of replay attacks detection system.
International Journal of Speech Technology | 2011
Debadatta Pati; S. R. M. Prasanna