Fred Richardson
Massachusetts Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Fred Richardson.
IEEE Signal Processing Letters | 2015
Fred Richardson; Douglas A. Reynolds; Najim Dehak
The impressive gains in performance obtained using deep neural networks (DNNs) for automatic speech recognition (ASR) have motivated the application of DNNs to other speech technologies such as speaker recognition (SR) and language recognition (LR). Prior work has shown performance gains for separate SR and LR tasks using DNNs for direct classification or for feature extraction. In this work we present the application of single DNN for both SR and LR using the 2013 Domain Adaptation Challenge speaker recognition (DAC13) and the NIST 2011 language recognition evaluation (LRE11) benchmarks. Using a single DNN trained for ASR on Switchboard data we demonstrate large gains on performance in both benchmarks: a 55% reduction in EER for the DAC13 out-of-domain condition and a 48% reduction in Cavg on the LRE11 30 s test condition. It is also shown that further gains are possible using score or feature fusion leading to the possibility of a single i-vector extractor producing state-of-the-art SR and LR performance.
international conference on acoustics, speech, and signal processing | 2010
Pedro A. Torres-Carrasquillo; Elliot Singer; Terry P. Gleason; Alan McCree; Douglas A. Reynolds; Fred Richardson; Douglas E. Sturim
This paper presents a description of the MIT Lincoln Laboratory language recognition system submitted to the NIST 2009 Language Recognition Evaluation (LRE). This system consists of a fusion of three core recognizers, two based on spectral similarity and one based on tokenization. The 2009 LRE differed from previous ones in that test data included narrowband segments from worldwide Voice of America broadcasts as well as conventional recorded conversational telephone speech. Results are presented for the 23-language closed-set and open-set detection tasks at the 30, 10, and 3 second durations along with a discussion of the language-pair task. On the 30 second 23-language closed set detection task, the system achieved a 1.64 average error rate.
international conference on acoustics, speech, and signal processing | 2008
Fred Richardson; William M. Campbell
One commonly used approach for language recognition is to convert the input speech into a sequence of tokens such as words or phones and then to use these token sequences to determine the target language. The language classification is typically performed by extracting N-gram statistics from the token sequences and then using an N-gram language model or support vector machine (SVM) to perform the classification. One problem with these approaches is that the number of N-grams grows exponentially as the order N is increased. This is especially problematic for an SVM classifier as each utterance is represented as a distinct N-gram vector. In this paper we propose a novel approach for modeling higher order N-grams using an SVM via an alternating filter-wrapper feature selection method. We demonstrate the effectiveness of this technique on the NIST 2007 language recognition task.
international conference on acoustics, speech, and signal processing | 2007
William M. Campbell; Fred Richardson; Douglas A. Reynolds
Language recognition is typically performed with methods that exploit phonotactics - a phone recognition language modeling (PRLM) system. A PRLM system converts speech to a lattice of phones and then scores a language model. A standard extension to this scheme is to use multiple parallel phone recognizers (PPRLM). In this paper, we modify this approach in two distinct ways. First, we replace the phone tokenizer by a powerful speech-to-text system. Second, we use a discriminative support vector machine for language modeling. Our goals are twofold. First, we explore the ability of a single speech-to-text system to distinguish multiple languages. Second, we fuse the new system with an SVM PRLM system to see if it complements current approaches. Experiments on the 2005 NIST language recognition corpus show the new word system accomplishes these goals and has significant potential for language recognition.
ieee automatic speech recognition and understanding workshop | 2007
Timothy J. Hazen; Fred Richardson; Anna Margolis
In this paper, we investigate the problem of topic identification from audio documents using features extracted from speech recognition lattices. We are particularly interested in the difficult case where the training material is minimally annotated with only topic labels. Under this scenario, the lexical knowledge that is useful for topic identification may not be available, and automatic methods for extracting linguistic knowledge useful for distinguishing between topics must be relied upon. Towards this goal we investigate the problem of topic identification on conversational telephone speech from the Fisher corpus under a variety of increasingly difficult constraints. We contrast the performance of systems that have knowledge of the lexical units present in the audio data, against systems that rely entirely on phonetic processing.
Odyssey 2016 | 2016
Pedro A. Torres-Carrasquillo; Najim Dehak; Elizabeth Godoy; Douglas A. Reynolds; Fred Richardson; Stephen Shum; Elliot Singer; Douglas E. Sturim
Abstract : In this paper we describe the most recent MIT Lincoln Laboratory language recognition system developed for the NIST 2015 Language Recognition Evaluation (LRE). The submission features a fusion of five core classifiers, with most systems developed in the context of an i-vector framework. The 2015 evaluation presented new paradigms. First, the evaluation included fixed training and open training tracks for the first time; second, language classification performance was measured across 6 language clusters using 20 language classes instead of an N-way language task; and third, performance was measured across a nominal 3-30 second range. Results are presented for the overall performance across the six language clusters for both the fixed and open training tasks. On the 6-cluster metric the Lincoln system achieved overall costs of 0.173 and 0.168 for the fixed and open tasks respectively.
international conference on acoustics speech and signal processing | 1998
George Zavaliagkos; John W. McDonough; David R. Miller; Amro El-Jaroudi; Jayadev Billa; Fred Richardson; Kristine W. Ma; Man-Hung Siu; Herbert Gish
This paper presents the 1997 BBN Byblos large vocabulary speech recognition (LVCSR) system. We give an outline of the algorithms and procedures used to train the system, describe the recognizer configuration and present the major technological innovations that lead to the performance improvements. The major testbed we present our results for is the Switchboard corpus, where current word error rates vary from 27% to 34% depending on the test set. In addition, we present results on the CallHome Spanish and Arabic tests, where we demonstrate that technology developed on English corpora is very much portable to other problems and languages.
international conference on acoustics, speech, and signal processing | 2011
Douglas E. Sturim; William M. Campbell; Najim Dehak; Zahi N. Karam; Alan McCree; Douglas A. Reynolds; Fred Richardson; Pedro A. Torres-Carrasquillo; Stephen Shum
Research in the speaker recognition community has continued to address methods of mitigating variational nuisances. Telephone and auxiliary-microphone recorded speech emphasize the need for a robust way of dealing with unwanted variation. The design of recent 2010 NIST-SRE Speaker Recognition Evaluation (SRE) reflects this research emphasis. In this paper, we present the MIT submission applied to the tasks of the 2010 NIST-SRE with two main goals—language-independent scalable modeling and robust nuisance mitigation. For modeling, exclusive use of inner product-based and cepstral systems produced a language-independent computationally-scalable system. For robustness, systems that captured spectral and prosodic information, modeled nuisance subspaces using multiple novel methods, and fused scores of multiple systems were implemented. The performance of the system is presented on a subset of the NIST SRE 2010 core tasks.
Odyssey 2016 | 2016
Fred Richardson; Brian E. Nemsick; Douglas A. Reynolds
Abstract : AbstractOver several decades, speaker recognition performance hassteadily improved for applications using telephone speech. Abig part of this improvement has been the availability of largequantities of speaker-labeled data from telephone recordings.For new data applications, such as audio from room microphones,we would like to effectively use existing telephone datato build systems with high accuracy while maintaining goodperformance on existing telephone tasks. In this paper we compareand combine approaches to compensate models parametersand features for this purpose. For model adaptation weexplore MAP adaptation of hyper-parameters and for featurecompensation we examine the use of denoising DNNs. On amulti-room, multi-microphone speaker recognition experimentwe show a reduction of 61% in EER with a combination of theseapproaches while slightly improving performance on telephonedata.
international conference on acoustics, speech, and signal processing | 2011
Reva Schwartz; Joseph P. Campbell; Wade Shen; Douglas E. Sturim; William M. Campbell; Fred Richardson; Robert B. Dunn; Robert Granville
The United States Secret Service (USSS) teamed with MIT Lincoln Laboratory (MIT/LL) in the US National Institute of Standards and Technologys 2010 Speaker Recognition Evaluation of Human Assisted Speaker Recognition (HASR). We describe our qualitative and automatic speaker comparison processes and our fusion of these processes, which are adapted from USSS casework. The USSS-MIT/LL 2010 HASR results are presented. We also present post-evaluation results. The results are encouraging within the resolving power of the evaluation, which was limited to enable reasonable levels of human effort. Future ideas and efforts are discussed, including new features and capitalizing on naïve listeners.