Douglas E. Sturim
Massachusetts Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Douglas E. Sturim.
IEEE Signal Processing Letters | 2006
William M. Campbell; Douglas E. Sturim; Douglas A. Reynolds
Gaussian mixture models (GMMs) have proven extremely successful for text-independent speaker recognition. The standard training method for GMM models is to use MAP adaptation of the means of the mixture components based on speech from a target speaker. Recent methods in compensation for speaker and channel variability have proposed the idea of stacking the means of the GMM model to form a GMM mean supervector. We examine the idea of using the GMM supervector in a support vector machine (SVM) classifier. We propose two new SVM kernels based on distance metrics between GMM models. We show that these SVM kernels produce excellent classification accuracy in a NIST speaker recognition evaluation task.
international conference on acoustics, speech, and signal processing | 2006
William M. Campbell; Douglas E. Sturim; Douglas A. Reynolds; Alex Solomonoff
Gaussian mixture models with universal backgrounds (UBMs) have become the standard method for speaker recognition. Typically, a speaker model is constructed by MAP adaptation of the means of the UBM. A GMM supervector is constructed by stacking the means of the adapted mixture components. A recent discovery is that latent factor analysis of this GMM supervector is an effective method for variability compensation. We consider this GMM supervector in the context of support vector machines. We construct a support vector machine kernel using the GMM supervector. We show similarities based on this kernel between the method of SVM nuisance attribute projection (NAP) and the recent results in latent factor analysis. Experiments on a NIST SRE 2005 corpus demonstrate the effectiveness of the new technique
international conference on acoustics, speech, and signal processing | 2001
Douglas E. Sturim; Douglas A. Reynolds; Elliot Singer; Joseph P. Campbell
Introduces the technique of anchor modeling in the applications of speaker detection and speaker indexing. The anchor modeling algorithm is refined by pruning the number of models needed. The system is applied to the speaker detection problem where its performance is shown to fall short of the state-of-the-art Gaussian mixture model with universal background model (GMM-UBM) system. However, it is further shown that its computational efficiency lends itself to speaker indexing for searching large audio databases for desired speakers. Here, excessive computation may prohibit the use of the GMM-UBM recognition system. Finally, the paper presents a method for cascading anchor model and GMM-UBM detectors for speaker indexing. This approach benefits from the efficiency of anchor modeling and high accuracy of GMM-UBM recognition.
international conference on acoustics, speech, and signal processing | 2010
Pedro A. Torres-Carrasquillo; Elliot Singer; Terry P. Gleason; Alan McCree; Douglas A. Reynolds; Fred Richardson; Douglas E. Sturim
This paper presents a description of the MIT Lincoln Laboratory language recognition system submitted to the NIST 2009 Language Recognition Evaluation (LRE). This system consists of a fusion of three core recognizers, two based on spectral similarity and one based on tokenization. The 2009 LRE differed from previous ones in that test data included narrowband segments from worldwide Voice of America broadcasts as well as conventional recorded conversational telephone speech. Results are presented for the 23-language closed-set and open-set detection tasks at the 30, 10, and 3 second durations along with a discussion of the language-pair task. On the 30 second 23-language closed set detection task, the system achieved a 1.64 average error rate.
international conference on acoustics, speech, and signal processing | 2005
Douglas E. Sturim; Douglas A. Reynolds
We discuss an extension to the widely used score normalization technique of test normalization (Tnorm) for text-independent speaker verification. A new method of speaker adaptive-Tnorm that offers advantages over the standard Tnorm by adjusting the speaker set to the target model is presented. Examples of this improvement using the 2004 NIST SRE data are also presented.
international conference on acoustics, speech, and signal processing | 2002
Douglas E. Sturim; Douglas A. Reynolds; Robert B. Dunn; Thomas F. Quatieri
In this paper we present an approach to close the gap between text-dependent and text-independent speaker verification performance. Text-constrained GMM-UBM systems are created using word segmentations produced by a LVCSR system on conversational speech allowing the system to focus on speaker differences over a constrained set of acoustic units. Results on the 2001 NIST extended data task show this approach can be used to produce an equal error rate of < 1 %.
international conference on acoustics, speech, and signal processing | 2005
Douglas A. Reynolds; William M. Campbell; Terry T. Gleason; Carl Quillen; Douglas E. Sturim; Pedro A. Torres-Carrasquillo; André Gustavo Adami
The MIT Lincoln Laboratory submission for the 2004 NIST speaker recognition evaluation (SRE) was built upon seven core systems using speaker information from short-term acoustics, pitch and duration prosodic behavior, and phoneme and word usage. These different levels of information were modeled and classified using Gaussian mixture models, support vector machines and N-gram language models and were combined using a single layer perceptron fuser. The 2004 SRE used a new multi-lingual, multi-channel speech corpus that provided a challenging speaker detection task for the above systems. We describe the core systems used and provide an overview of their performance on the 2004 SRE detection tasks.
international conference on acoustics, speech, and signal processing | 2005
Nicolas Malyska; Thomas F. Quatieri; Douglas E. Sturim
A dysphonia, or disorder of the mechanisms of phonation in the larynx, can create time-varying amplitude fluctuations in the voice. A model for band-dependent analysis of this amplitude modulation (AM) phenomenon in dysphonic speech is developed from a traditional communications engineering perspective. This perspective challenges current dysphonia analysis methods that analyze AM in the time-domain signal. An automatic dysphonia recognition system is designed to exploit AM in voice using a biologically inspired model of the inferior colliculus. This system, built upon a Gaussian-mixture-model (GMM) classification backend, recognizes the presence of dysphonia in the voice signal. Recognition experiments using data obtained from the Kay elemetrics voice disorders database suggest that the system provides complementary information to state-of-the-art mel-cepstral features. We present dysphonia recognition as an approach to developing features that capture glottal source differences in normal speech.
Odyssey 2016 | 2016
Pedro A. Torres-Carrasquillo; Najim Dehak; Elizabeth Godoy; Douglas A. Reynolds; Fred Richardson; Stephen Shum; Elliot Singer; Douglas E. Sturim
Abstract : In this paper we describe the most recent MIT Lincoln Laboratory language recognition system developed for the NIST 2015 Language Recognition Evaluation (LRE). The submission features a fusion of five core classifiers, with most systems developed in the context of an i-vector framework. The 2015 evaluation presented new paradigms. First, the evaluation included fixed training and open training tracks for the first time; second, language classification performance was measured across 6 language clusters using 20 language classes instead of an N-way language task; and third, performance was measured across a nominal 3-30 second range. Results are presented for the overall performance across the six language clusters for both the fixed and open training tasks. On the 6-cluster metric the Lincoln system achieved overall costs of 0.173 and 0.168 for the fixed and open tasks respectively.
international conference on acoustics, speech, and signal processing | 2007
William M. Campbell; Douglas E. Sturim; Wade Shen; Douglas A. Reynolds; Jiri Navratil
Many powerful methods for speaker recognition have been introduced in recent years - high-level features, novel classifiers, and channel compensation methods. A common arena for evaluating these methods has been the NIST speaker recognition evaluation (SRE). In the NIST SRE from 2002-2005, a popular approach was to fuse multiple systems based upon cepstral features and different linguistic tiers of high-level features. With enough enrollment data, this approach produced dramatic error rate reductions and showed conceptually that better performance was attainable. A drawback in this approach is that many high-level systems were being run independently requiring significant computational complexity and resources. In 2006, MIT Lincoln Laboratory focused on a new system architecture which emphasized reduced complexity. This system was a carefully selected mixture of high-level techniques, new classifier methods, and novel channel compensation techniques. This new system has excellent accuracy and has substantially reduced complexity. The performance and computational aspects of the system are detailed on a NIST 2006 SRE task.