Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Nicolas Scheffer is active.

Publication


Featured researches published by Nicolas Scheffer.


international conference on acoustics, speech, and signal processing | 2014

A novel scheme for speaker recognition using a phonetically-aware deep neural network

Yun Lei; Nicolas Scheffer; Luciana Ferrer; Mitchell McLaren

We propose a novel framework for speaker recognition in which extraction of sufficient statistics for the state-of-the-art i-vector model is driven by a deep neural network (DNN) trained for automatic speech recognition (ASR). Specifically, the DNN replaces the standard Gaussian mixture model (GMM) to produce frame alignments. The use of an ASR-DNN system in the speaker recognition pipeline is attractive as it integrates the information from speech content directly into the statistics, allowing the standard backends to remain unchanged. Improvement from the proposed framework compared to a state-of-the-art system are of 30% relative at the equal error rate when evaluated on the telephone conditions from the 2012 NIST speaker recognition evaluation (SRE). The proposed framework is a successful way to efficiently leverage transcribed data for speaker recognition, thus opening up a wide spectrum of research directions.


international conference on acoustics, speech, and signal processing | 2012

Towards noise-robust speaker recognition using probabilistic linear discriminant analysis

Yun Lei; Lukas Burget; Luciana Ferrer; Martin Graciarena; Nicolas Scheffer

This work addresses the problem of speaker verification where additive noise is present in the enrollment and testing utterances. We show how the current state-of-the-art framework can be effectively used to mitigate this effect. We first look at the degradation a standard speaker verification system is subjected to when presented with noisy speech waveforms. We designed and generated a corpus with noisy conditions, based on the NIST SRE 2008 and 2010 data, built using open-source tools and freely available noise samples. We then show how adding noisy training data in the current i-vector-based approach followed by probabilistic linear discriminant analysis (PLDA) can bring significant gains in accuracy at various signal-to-noise ratio (SNR) levels. We demonstrate that this improvement is not feature-specific as we present positive results for three disparate sets of features: standard mel frequency cepstral coefficients, prosodic polynomial co-efficients and maximum likelihood linear regression (MLLR) transforms.


international conference on acoustics, speech, and signal processing | 2009

THE SRI NIST 2008 speaker recognition evaluation system

Sachin S. Kajarekar; Nicolas Scheffer; Martin Graciarena; Elizabeth Shriberg; Andreas Stolcke; Luciana Ferrer; Tobias Bocklet

The SRI speaker recognition system for the 2008 NIST speaker recognition evaluation (SRE) incorporates a variety of models and features, both cepstral and stylistic. We highlight the improvements made to specific subsystems and analyze the performance of various subsystem combinations in different data conditions. We show the importance of language and nativeness conditioning, as well as the role of ASR for speaker verification.


international conference on acoustics, speech, and signal processing | 2013

A noise robust i-vector extractor using vector taylor series for speaker recognition

Yun Lei; Lukas Burget; Nicolas Scheffer

We propose a novel approach for noise-robust speaker recognition, where the model of distortions caused by additive and convolutive noises is integrated into the i-vector extraction framework. The model is based on a vector taylor series (VTS) approximation widely successful in noise robust speech recognition. The model allows for extracting “cleaned-up” i-vectors which can be used in a standard i-vector back end. We evaluate the proposed framework on the PRISM corpus, a NIST-SRE like corpus, where noisy conditions were created by artificially adding babble noises to clean speech segments. Results show that using VTS i-vectors present significant improvements in all noisy conditions compared to a state-of-the-art baseline speaker recognition. More importantly, the proposed framework is robust to noise, as improvements are maintained when the system is trained on clean data.


international conference on acoustics, speech, and signal processing | 2013

Improving speaker identification robustness to highly channel-degraded speech through multiple system fusion

Mitchell McLaren; Nicolas Scheffer; Martin Graciarena; Luciana Ferrer; Yun Lei

This article describes our submission to the speaker identification (SID) evaluation for the first phase of the DARPA Robust Automatic Transcription of Speech (RATS) program. The evaluation focuses on speech data heavily degraded by channel effects. We show here how we designed a robust system using multiple streams of noise-robust features that were combined at a later stage in an i-vector framework. For all channels of interest, our combination strategy presents up to a 41% relative improvement in miss rate at a 4% false alarm rate with respect to the best-performing single-stream system.


international conference on acoustics, speech, and signal processing | 2012

iVector-based prosodic system for language identification

David Martinez; Lukas Burget; Luciana Ferrer; Nicolas Scheffer

Prosody is the part of speech where rhythm, stress, and intonation are reflected. In language identification tasks, these characteristics are assumed to be language dependent, and thus the language can be identified from them. In this paper, an automatic language recognition system that extracts prosody information from speech and makes decisions about the language with a generative classifier based on iVectors is built. The system is tested on the NIST LRE09 dataset. The results are still not comparable to state-of-the-art acoustic and phonotactic systems. However, they are promising and the fusion of the new approach with an iVector-based acoustic system is found to bring further improvements over the latter.


international conference on acoustics, speech, and signal processing | 2010

A comparison of approaches for modeling prosodic features in speaker recognition

Luciana Ferrer; Nicolas Scheffer; Elizabeth Shriberg

Prosodic information has been successfully used for speaker recognition for more than a decade. The best-performing prosodic system to date has been one based on features extracted over syllables obtained automatically from speech recognition output. The features are then transformed using a Fisher kernel, and speaker models are trained using support vector machines (SVMs). Recently, a simpler version of these features, based on pseudo-syllables was shown to perform well when modeled using joint factor analysis (JFA). In this work, we study the two modeling techniques for the simpler set of features. We show that, for these features, a combination of JFA systems for different sequence lengths greatly outperforms both original modeling methods. Furthermore, we show that the combination of both methods gives significant improvements over the best single system. Overall, a performance improvement of 30% in the detection cost function (DCF) with respect to the two previously published methods is achieved using very simple strategies.


IEEE Transactions on Audio, Speech, and Language Processing | 2016

Study of senone-based deep neural network approaches for spoken language recognition

Luciana Ferrer; Yun Lei; Mitchell McLaren; Nicolas Scheffer

This paper compares different approaches for using deep neural networks (DNNs) trained to predict senone posteriors for the task of spoken language recognition (SLR). These approaches have recently been found to outperform various baseline systems on different datasets, but they have not yet been compared to each other or to a common baseline. Two of these approaches use the DNNs to generate feature vectors which are then processed in different ways to predict the score of each language given a test sample. The features are extracted either from a bottleneck layer in the DNN or from the output layer. In the third approach, the standard i-vector extraction procedure is modified to use the senones as classes and the DNN to predict the zeroth order statistics. We compare these three approaches and conclude that the approach based on bottleneck features followed by i-vector modeling outperform the other two approaches. We also show that score-level fusion of some of these approaches leads to gains over using a single approach for short-duration test samples. Finally, we demonstrate that fusing systems that use DNNs trained with several languages leads to improvements in performance over the best single system, and we propose an adaptation procedure for DNNs trained with languages with less available data. Overall, we show improvements between 40% and 70% relative to a state-of-the-art Gaussian mixture model (GMM) i-vector system on test durations from 3 seconds to 120 seconds on two significantly different tasks: the NIST 2009 language recognition evaluation task and the DARPA RATS language identification task.


international conference on acoustics, speech, and signal processing | 2014

EFFECTIVE USE OF DCTS FOR CONTEXTUALIZING FEATURES FOR SPEAKER RECOGNITION

Mitchell McLaren; Nicolas Scheffer; Luciana Ferrer; Yun Lei

This article proposes a new approach for contextualizing features for speaker recognition through the discrete cosine transform (DCT). Specifically, we apply a 2D-DCT transform on the Mel filterbank outputs to replace the common Mel frequency cepstral coefficients (MFCCs) appended by deltas and double deltas. A thorough comparison of algorithms for delta computation and DCT-based contextualization for speaker recognition is provided and the effect of varying the size of analysis window in each case is considered. Selection of 2D-DCT coefficients using a zig-zag approach permits definition of an arbitrary feature dimension using the most energized coefficients. We show that 60 coefficients computed using our approach outperforms the standard MFCCs appended with double deltas by up to 25% relative on the NIST 2012 speaker recognition evaluation (SRE) corpus in both Cprimary and equal error rate (EER) while additional coefficients increase system robustness to noise.


2006 IEEE Odyssey - The Speaker and Language Recognition Workshop | 2006

UBM-GMM Driven Discriminative Approach for Speaker Verification

Nicolas Scheffer; Jean-François Bonastre

In the past few years, discriminative approaches to perform speaker detection have shown good results and an increasing interest. Among these methods, SVM based systems have lots of advantages, especially their ability to deal with a high dimension feature space. Generative systems such as UBM-GMM systems show the greatest performance among other systems in speaker verification tasks. Combination of generative and discriminative approaches is not a new idea and has been studied several times by mapping a whole speech utterance onto a fixed length vector. This paper presents a straight-forward, cost friendly method to combine the two approaches with the use of a UBM model only to drive the experiment. We show that the use of the TFLLR kernel, while closely related to a reduced form of the Fisher mapping, implies a performance that is close to a standard GMM/UBM based speaker detection system. Moreover, we show that a combination of both outperforms the systems taken independently

Collaboration


Dive into the Nicolas Scheffer's collaboration.

Top Co-Authors

Avatar

Luciana Ferrer

University of Buenos Aires

View shared research outputs
Top Co-Authors

Avatar

Yun Lei

University of Texas at Dallas

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Elizabeth Shriberg

Ludwig Maximilian University of Munich

View shared research outputs
Researchain Logo
Decentralizing Knowledge