Subhadeep Dey | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Subhadeep Dey is active.

Explore More

Publication

Featured researches published by Subhadeep Dey.

international conference on acoustics, speech, and signal processing | 2015

Employment of Subspace Gaussian Mixture Models in speaker recognition

Petr Motlicek; Subhadeep Dey; Srikanth R. Madikeri; Lukas Burget

This paper presents Subspace Gaussian Mixture Model (SGMM) approach employed as a probabilistic generative model to estimate speaker vector representations to be subsequently used in the speaker verification task. SGMMs have already been shown to significantly outperform traditional HMM/GMMs in Automatic Speech Recognition (ASR) applications. An extension to the basic SGMM framework allows to robustly estimate low-dimensional speaker vectors and exploit them for speaker adaptation. We propose a speaker verification framework based on low-dimensional speaker vectors estimated using SGMMs, trained in ASR manner using manual transcriptions. To test the robustness of the system, we evaluate the proposed approach with respect to the state-of-the-art i-vector extractor on the NIST SRE 2010 evaluation set and on four different length-utterance conditions: 3sec-10sec, 10 sec-30 sec, 30 sec-60 sec and full (untruncated) utterances. Experimental results reveal that while i-vector system performs better on truncated 3sec to 10sec and 10 sec to 30 sec utterances, noticeable improvements are observed with SGMMs especially on full length-utterance durations. Eventually, the proposed SGMM approach exhibits complementary properties and can thus be efficiently fused with i-vector based speaker verification system.

international conference on acoustics, speech, and signal processing | 2016

Deep neural network based posteriors for text-dependent speaker verification

Subhadeep Dey; Srikanth R. Madikeri; Marc Ferras; Petr Motlicek

The i-vector and Joint Factor Analysis (JFA) systems for text-dependent speaker verification use sufficient statistics computed from a speech utterance to estimate speaker models. These statistics average the acoustic information over the utterance thereby losing all the sequence information. In this paper, we study explicit content matching using Dynamic Time Warping (DTW) and present the best achievable error rates for speaker-dependent and speaker-independent content matching. For this purpose, a Deep Neural Network/Hidden Markov Model Automatic Speech Recognition (DNN/HMM ASR) system is used to extract content-related posterior probabilities. This approach outperforms systems using Gaussian mixture model posteriors by at least 50% Equal Error Rate (EER) on the RSR2015 in content mismatch trials. DNN posteriors are also used in i-vector and JFA systems, obtaining EERs as low as 0.02%.

Speech Communication | 2017

Template-matching for text-dependent speaker verification

Subhadeep Dey; Petr Motlicek; Srikanth R. Madikeri; Marc Ferras

In the last decade, i-vector and Joint Factor Analysis (JFA) approaches to speaker modeling have become ubiquitous in the area of automatic speaker recognition. Both of these techniques involve the computation of posterior probabilities, using either Gaussian Mixture Models (GMM) or Deep Neural Networks (DNN), as a prior step to estimating i-vectors or speaker factors. GMMs focus on implicitly modeling phonetic information of acoustic features while DNNs focus on explicitly modeling phonetic/linguistic units. For text-dependent speaker verification, DNN-based systems have considerably outperformed GMM for fixed-phrase tasks. However, both approaches ignore phone sequence information. In this paper, we aim at exploiting this information by using Dynamic Time Warping (DTW) with speaker-informative features. These features are obtained from i-vector models extracted over short speech segments, also called online i-vectors. Probabilistic Linear Discriminant Analysis (PLDA) is further used to project online i-vectors onto a speaker-discriminative subspace. The proposed DTW approach obtained at least 74% relative improvement in equal error rate on the RSR corpus over other state-of-the-art approaches, including i-vector and JFA.

international conference on acoustics, speech, and signal processing | 2016

Information theoretic clustering for unsupervised domain-adaptation

Subhadeep Dey; Srikanth R. Madikeri; Petr Motlicek

The aim of the domain-adaptation task for speaker verification is to exploit unlabelled target domain data by using the labelled source domain data effectively. The i-vector based Probabilistic Linear Discriminant Analysis (PLDA) framework approaches this task by clustering the target domain data and using each cluster as a unique speaker to estimate PLDA model parameters. These parameters are then combined with the PLDA parameters from the source domain. Typically, agglomerative clustering with cosine distance measure is used. In tasks such as speaker diarization that also require unsupervised clustering of speakers, information-theoretic clustering measures have been shown to be effective. In this paper, we employ the Information Bottleneck (IB) clustering technique to find speaker clusters in the target domain data. This is achieved by optimizing the IB criterion that minimizes the information loss during the clustering process. The greedy optimization of the IB criterion involves agglomerative clustering using the Jensen-Shannon divergence as the distance metric. Our experiments in the domain-adaptation task indicate that the proposed system outperforms the baseline by about 14% relative in terms of equal error rate.

IEEE Signal Processing Letters | 2016

A Large-Scale Open-Source Acoustic Simulator for Speaker Recognition

Marc Ferras; Srikanth R. Madikeri; Petr Motlicek; Subhadeep Dey

The state-of-the-art speaker-recognition systems suffer from significant performance loss on degraded speech conditions and acoustic mismatch between enrolment and test phases. Past international evaluation campaigns, such as the NIST speaker recognition evaluation (SRE), have partly addressed these challenges in some evaluation conditions. This work aims at further assessing and compensating for the effect of a wide variety of speech-degradation processes on speaker-recognition performance. We present an open-source simulator generating degraded telephone, VoIP, and interview-speech recordings using a comprehensive list of narrow-band, wide-band, and audio codecs, together with a database of over 60 h of environmental noise recordings and over 100 impulse responses collected from publicly available data. We provide speaker-verification results obtained with an i-vector-based system using either a clean or degraded PLDA back-end on a NIST SRE subset of data corrupted by the proposed simulator. While error rates increase considerably under degraded speech conditions, large relative equal error rate (EER) reductions were observed when using a PLDA model trained with a large number of degraded sessions per speaker.

international conference on acoustics, speech, and signal processing | 2017

Intra-class covariance adaptation in PLDA back-ends for speaker verification

Srikanth R. Madikeri; Marc Ferras; Petr Motlicek; Subhadeep Dey

Multi-session training conditions are becoming increasingly common in recent benchmark datasets for both text-independent and text-dependent speaker verification. In the state-of-the-art i-vector framework for speaker verification, such conditions are addressed by simple techniques such as averaging the individual i-vectors, averaging scores, or modifying the Probabilistic Linear Discriminant Analysis (PLDA) scoring hypothesis for multi-session enrollment. The aforementioned techniques fail to exploit the speaker variabilities observed in the enrollment data for target speakers. In this paper, we propose to exploit the multi-session training data by estimating a speaker-dependent covariance matrix and updating the intra-speaker covariance during PLDA scoring for each target speaker. The proposed method is further extended by combining covariance adaptation and score averaging. In this method, the individual examples of the target speaker are compared against the test data as opposed to an averaged ivector, and the scores obtained are then averaged. The proposed methods are evaluated on the NIST SRE 2012 dataset. Relative improvements of up to 29% in equal error rate are obtained.

international conference on acoustics, speech, and signal processing | 2017

Exploiting sequence information for text-dependent Speaker Verification

Subhadeep Dey; Petr Motlicek; Srikanth R. Madikeri; Marc Ferras

Model-based approaches to Speaker Verification (SV), such as Joint Factor Analysis (JFA), i-vector and relevance Maximum-a-Posteriori (MAP), have shown to provide state-of-the-art performance for text-dependent systems with fixed phrases. The performance of i-vector and JFA models has been further enhanced by estimating posteriors from Deep Neural Network (DNN) instead of Gaussian Mixture Model (GMM). While both DNNs and GMMs aim at incorporating phonetic information of the phrase with these posteriors, model-based SV approaches ignore the sequence information of the phonetic units of the phrase. In this paper, we tackle this issue by applying dynamic time warping using speaker-informative features. We propose to use i-vectors computed from short segments of each speech utterance, also called online i-vectors, as feature vectors. The proposed approach is evaluated on the RedDots database and provides an improvement of 75% relative equal error rate over the best model-based SV baseline system in a content-mismatch condition.

conference of the international speech communication association | 2016

Inter-Task System Fusion for Speaker Recognition.

Marc Ferras; Srikanth R. Madikeri; Subhadeep Dey; Petr Motlicek

Fusion is a common approach to improving the performance of speaker recognition systems. Multiple systems using different data, features or algorithms tend to bring complementary contributions to the final decisions being made. It is known that factors such as native language or accent contribute to speaker identity. In this paper, we explore inter-task fusion approaches to incorporating side information from accent and language identification systems to improve the performance of a speaker verification system. We explore both score level and model level approaches, linear logistic regression and linear discriminant analysis respectively, reporting significant gains on accented and multi-lingual data sets of the NIST Speaker Recognition Evaluation 2008 data. Equal error rate and expected rank metrics are reported for speaker verification and speaker identification tasks.

Archive | 2016