Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Marc Ferras is active.

Publication


Featured researches published by Marc Ferras.


spoken language technology workshop | 2012

Speaker diarization and linking of large corpora

Marc Ferras; Herve Boudard

Performing speaker diarization of a collection of recordings, where speakers are uniquely identified across the database, is a challenging task. In this context, inter-session variability compensation and reasonable computation times are essential to be addressed. In this paper we propose a two-stage system composed of speaker diarization and speaker linking modules that are able to perform data set wide speaker diarization and that handle both large volumes of data and inter-session variability compensation. The speaker linking system agglomeratively clusters speaker factor posterior distributions, obtained within the Joint Factor Analysis framework, that model the speaker clusters output by a standard speaker diarization system. Therefore, the technique inherently compensates the channel variability effects from recording to recording within the database. A threshold is used to obtain meaningful speaker clusters by cutting the dendrogram obtained by the agglomerative clustering. We show how the Hotteling t-square statistic is an interesting distance measure for this task and input data, obtaining the best results and stability. The system is evaluated using three subsets of the AMI corpus involving different speaker and channel variabilities. We use the within-recording and across-recording diarization error rates (DER), cluster purity and cluster coverage to measure the performance of the proposed system. Across-recording DER as low as within-recording DER are obtained for some system setups.


international conference on acoustics, speech, and signal processing | 2013

MLP-based factor analysis for tandem speech recognition

Marc Ferras

In the last years, latent variable models such as factor analysis, probabilistic principal component analysis or subspace Gaussian mixture models have become almost ubiquitous in speech technologies. The key to its success is the joint modeling of multiple effects in the speech signal they address. In this paper, we propose a novel approach to use phone and speaker variabilities together to estimate phone posterior probabilities on a tandem speech recognition system. A Multilayer Perceptron (MLP) with 5 layers and a central bottleneck linear layer is used as a basic processing block that mimics the processing undergone in factor analysis. With multiple factors, phone and a speaker MLP are merged at the bottleneck level to obtain better estimates for the phone posterior probabilities used in the ASR system. Experiments on the WSJ corpus show that the joint phone-speaker modeling can significantly outperform phone modeling alone in terms of Frame Error and Word Error Rates.


international conference on acoustics, speech, and signal processing | 2016

Deep neural network based posteriors for text-dependent speaker verification

Subhadeep Dey; Srikanth R. Madikeri; Marc Ferras; Petr Motlicek

The i-vector and Joint Factor Analysis (JFA) systems for text-dependent speaker verification use sufficient statistics computed from a speech utterance to estimate speaker models. These statistics average the acoustic information over the utterance thereby losing all the sequence information. In this paper, we study explicit content matching using Dynamic Time Warping (DTW) and present the best achievable error rates for speaker-dependent and speaker-independent content matching. For this purpose, a Deep Neural Network/Hidden Markov Model Automatic Speech Recognition (DNN/HMM ASR) system is used to extract content-related posterior probabilities. This approach outperforms systems using Gaussian mixture model posteriors by at least 50% Equal Error Rate (EER) on the RSR2015 in content mismatch trials. DNN posteriors are also used in i-vector and JFA systems, obtaining EERs as low as 0.02%.


IEEE Transactions on Audio, Speech, and Language Processing | 2016

Speaker Diarization and Linking of Meeting Data

Marc Ferras; Srikanth R. Madikeri

Finding who spoke when in a collection of recordings, with speakers being uniquely identified across the database, is a challenging task. In this scenario, reasonable computing times and acoustic variation across recordings remain two major concerns to address in state-of-the-art speaker diarization systems. This paper extends prior work on diarizing large speech datasets using algorithms that scale well with increasing amounts of data while compensating for across-recording variability. We follow a two-stage approach performing speaker diarization and speaker linking, the former focusing on local within-recording speaker changes and the latter focusing on global speaker changes across the database. In this study, we explore how these two modules interact with each other, while proposing a diarization fusion approach that prevents diarization errors from propagating to the linking stage. We further explore the diarization fusion for speaker linking using different linking strategies and speaker modeling variants. Evaluation is performed on single distant microphone data from the augmented multiparty interaction corpus show the effectiveness of the fusion approach after speaker linking and intersession variability modeling via joint factor analysis.


Speech Communication | 2017

Template-matching for text-dependent speaker verification

Subhadeep Dey; Petr Motlicek; Srikanth R. Madikeri; Marc Ferras

In the last decade, i-vector and Joint Factor Analysis (JFA) approaches to speaker modeling have become ubiquitous in the area of automatic speaker recognition. Both of these techniques involve the computation of posterior probabilities, using either Gaussian Mixture Models (GMM) or Deep Neural Networks (DNN), as a prior step to estimating i-vectors or speaker factors. GMMs focus on implicitly modeling phonetic information of acoustic features while DNNs focus on explicitly modeling phonetic/linguistic units. For text-dependent speaker verification, DNN-based systems have considerably outperformed GMM for fixed-phrase tasks. However, both approaches ignore phone sequence information. In this paper, we aim at exploiting this information by using Dynamic Time Warping (DTW) with speaker-informative features. These features are obtained from i-vector models extracted over short speech segments, also called online i-vectors. Probabilistic Linear Discriminant Analysis (PLDA) is further used to project online i-vectors onto a speaker-discriminative subspace. The proposed DTW approach obtained at least 74% relative improvement in equal error rate on the RSR corpus over other state-of-the-art approaches, including i-vector and JFA.


international conference on acoustics, speech, and signal processing | 2016

System fusion and speaker linking for longitudinal diarization of TV shows

Marc Ferras; Srikanth R. Madikeri; Petr Motlicek; Hervé Bourlard

Performing speaker diarization while uniquely identifying the speakers in a collection of audio recordings is a challenging task. Based on our previous work on speaker diarization and linking, we developed a system for diarizing longitudinal TV show data sets based on the fusion of speaker diarization system outputs and speaker linking. Agreement between multiple diarization outputs is found prior to speaker linking, largely reducing the diarization error rate at the expense of keeping some speech data unlabelled. To deal with noisy clusters, a linear prediction based technique was used to label speakers after linking. Considerable gains for both fusion and labelling are reported. Despite the challenges of the longitudinal diarization task, this system obtained similar performance for linked and non-linked tasks under moderate session variability, highlighting the viability of a linking approach to longitudinal diarization of speech in the presence of noise, music and special audio effects.


IEEE Signal Processing Letters | 2016

A Large-Scale Open-Source Acoustic Simulator for Speaker Recognition

Marc Ferras; Srikanth R. Madikeri; Petr Motlicek; Subhadeep Dey

The state-of-the-art speaker-recognition systems suffer from significant performance loss on degraded speech conditions and acoustic mismatch between enrolment and test phases. Past international evaluation campaigns, such as the NIST speaker recognition evaluation (SRE), have partly addressed these challenges in some evaluation conditions. This work aims at further assessing and compensating for the effect of a wide variety of speech-degradation processes on speaker-recognition performance. We present an open-source simulator generating degraded telephone, VoIP, and interview-speech recordings using a comprehensive list of narrow-band, wide-band, and audio codecs, together with a database of over 60 h of environmental noise recordings and over 100 impulse responses collected from publicly available data. We provide speaker-verification results obtained with an i-vector-based system using either a clean or degraded PLDA back-end on a NIST SRE subset of data corrupted by the proposed simulator. While error rates increase considerably under degraded speech conditions, large relative equal error rate (EER) reductions were observed when using a PLDA model trained with a large number of degraded sessions per speaker.


international conference on acoustics, speech, and signal processing | 2017

Intra-class covariance adaptation in PLDA back-ends for speaker verification

Srikanth R. Madikeri; Marc Ferras; Petr Motlicek; Subhadeep Dey

Multi-session training conditions are becoming increasingly common in recent benchmark datasets for both text-independent and text-dependent speaker verification. In the state-of-the-art i-vector framework for speaker verification, such conditions are addressed by simple techniques such as averaging the individual i-vectors, averaging scores, or modifying the Probabilistic Linear Discriminant Analysis (PLDA) scoring hypothesis for multi-session enrollment. The aforementioned techniques fail to exploit the speaker variabilities observed in the enrollment data for target speakers. In this paper, we propose to exploit the multi-session training data by estimating a speaker-dependent covariance matrix and updating the intra-speaker covariance during PLDA scoring for each target speaker. The proposed method is further extended by combining covariance adaptation and score averaging. In this method, the individual examples of the target speaker are compared against the test data as opposed to an averaged ivector, and the scores obtained are then averaged. The proposed methods are evaluated on the NIST SRE 2012 dataset. Relative improvements of up to 29% in equal error rate are obtained.


international conference on acoustics, speech, and signal processing | 2017

Exploiting sequence information for text-dependent Speaker Verification

Subhadeep Dey; Petr Motlicek; Srikanth R. Madikeri; Marc Ferras

Model-based approaches to Speaker Verification (SV), such as Joint Factor Analysis (JFA), i-vector and relevance Maximum-a-Posteriori (MAP), have shown to provide state-of-the-art performance for text-dependent systems with fixed phrases. The performance of i-vector and JFA models has been further enhanced by estimating posteriors from Deep Neural Network (DNN) instead of Gaussian Mixture Model (GMM). While both DNNs and GMMs aim at incorporating phonetic information of the phrase with these posteriors, model-based SV approaches ignore the sequence information of the phonetic units of the phrase. In this paper, we tackle this issue by applying dynamic time warping using speaker-informative features. We propose to use i-vectors computed from short segments of each speech utterance, also called online i-vectors, as feature vectors. The proposed approach is evaluated on the RedDots database and provides an improvement of 75% relative equal error rate over the best model-based SV baseline system in a content-mismatch condition.


conference of the international speech communication association | 2016

Inter-Task System Fusion for Speaker Recognition.

Marc Ferras; Srikanth R. Madikeri; Subhadeep Dey; Petr Motlicek

Fusion is a common approach to improving the performance of speaker recognition systems. Multiple systems using different data, features or algorithms tend to bring complementary contributions to the final decisions being made. It is known that factors such as native language or accent contribute to speaker identity. In this paper, we explore inter-task fusion approaches to incorporating side information from accent and language identification systems to improve the performance of a speaker verification system. We explore both score level and model level approaches, linear logistic regression and linear discriminant analysis respectively, reporting significant gains on accented and multi-lingual data sets of the NIST Speaker Recognition Evaluation 2008 data. Equal error rate and expected rank metrics are reported for speaker verification and speaker identification tasks.

Collaboration


Dive into the Marc Ferras's collaboration.

Top Co-Authors

Avatar

Petr Motlicek

Idiap Research Institute

View shared research outputs
Top Co-Authors

Avatar

Subhadeep Dey

Idiap Research Institute

View shared research outputs
Top Co-Authors

Avatar

Srikanth R. Madikeri

Indian Institute of Technology Madras

View shared research outputs
Top Co-Authors

Avatar

Srikanth R. Madikeri

Indian Institute of Technology Madras

View shared research outputs
Top Co-Authors

Avatar

Koichi Shinoda

Tokyo Institute of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Claude Barras

Centre national de la recherche scientifique

View shared research outputs
Top Co-Authors

Avatar

Sadaoki Furui

Tokyo Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Pranay Dighe

Idiap Research Institute

View shared research outputs
Top Co-Authors

Avatar

Sangeeta Biswas

Tokyo Institute of Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge