Pedro A. Torres-Carrasquillo

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Pedro A. Torres-Carrasquillo is active.

Explore More

Publication

Featured researches published by Pedro A. Torres-Carrasquillo.

Computer Speech & Language | 2006

Support vector machines for speaker and language recognition

William M. Campbell; Joseph P. Campbell; Douglas A. Reynolds; Elliot Singer; Pedro A. Torres-Carrasquillo

Abstract Support vector machines (SVMs) have proven to be a powerful technique for pattern classification. SVMs map inputs into a high-dimensional space and then separate classes with a hyperplane. A critical aspect of using SVMs successfully is the design of the inner product, the kernel, induced by the high dimensional mapping. We consider the application of SVMs to speaker and language recognition. A key part of our approach is the use of a kernel that compares sequences of feature vectors and produces a measure of similarity. Our sequence kernel is based upon generalized linear discriminants. We show that this strategy has several important properties. First, the kernel uses an explicit expansion into SVM feature space—this property makes it possible to collapse all support vectors into a single model vector and have low computational complexity. Second, the SVM builds upon a simpler mean-squared error classifier to produce a more accurate system. Finally, the system is competitive and complimentary to other approaches, such as Gaussian mixture models (GMMs). We give results for the 2003 NIST speaker and language evaluations of the system and also show fusion with the traditional GMM approach.

international conference on acoustics, speech, and signal processing | 2005

Approaches and applications of audio diarization

Douglas A. Reynolds; Pedro A. Torres-Carrasquillo

Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/channel characteristics. Diarization has utility in making automatic transcripts more readable and in searching and indexing audio archives. In this paper, we provide an overview of current audio diarization approaches and discuss performance and potential applications. We outline the general framework of diarization systems and present the performance of current systems as measured in the DARPA EARS Rich Transcription Fall 2004 (RT-04F) speaker diarization evaluation. Lastly, we look at future challenges and directions for diarization research.

international conference on acoustics, speech, and signal processing | 2002

Language identification using Gaussian mixture model tokenization

Pedro A. Torres-Carrasquillo; Douglas A. Reynolds; John R. Deller

Phone tokenization followed by n-gram language modeling has consistently provided good results for the task of language identification. In this paper, this technique is generalized by using Gaussian mixture models as the basis for tokenizing. Performance results are presented for a system employing a GMM tokenizer in conjunction with multiple language processing and score combination techniques. On the 1996 CallFriend LID evaluation set, a 12-way closed set error rate of 17% was obtained.

international conference on acoustics, speech, and signal processing | 2010

The MITLL NIST LRE 2009 language recognition system

Pedro A. Torres-Carrasquillo; Elliot Singer; Terry P. Gleason; Alan McCree; Douglas A. Reynolds; Fred Richardson; Douglas E. Sturim

This paper presents a description of the MIT Lincoln Laboratory language recognition system submitted to the NIST 2009 Language Recognition Evaluation (LRE). This system consists of a fusion of three core recognizers, two based on spectral similarity and one based on tokenization. The 2009 LRE differed from previous ones in that test data included narrowband segments from worldwide Voice of America broadcasts as well as conventional recorded conversational telephone speech. Results are presented for the 23-language closed-set and open-set detection tasks at the 30, 10, and 3 second durations along with a discussion of the language-pair task. On the 30 second 23-language closed set detection task, the system achieved a 1.64 average error rate.

international conference on acoustics, speech, and signal processing | 2005

The 2004 MIT Lincoln Laboratory speaker recognition system

Douglas A. Reynolds; William M. Campbell; Terry T. Gleason; Carl Quillen; Douglas E. Sturim; Pedro A. Torres-Carrasquillo; André Gustavo Adami

The MIT Lincoln Laboratory submission for the 2004 NIST speaker recognition evaluation (SRE) was built upon seven core systems using speaker information from short-term acoustics, pitch and duration prosodic behavior, and phoneme and word usage. These different levels of information were modeled and classified using Gaussian mixture models, support vector machines and N-gram language models and were combined using a single layer perceptron fuser. The 2004 SRE used a new multi-lingual, multi-channel speech corpus that provided a challenging speaker detection task for the above systems. We describe the core systems used and provide an overview of their performance on the 2004 SRE detection tasks.

2006 IEEE Odyssey - The Speaker and Language Recognition Workshop | 2006

Advanced Language Recognition using Cepstra and Phonotactics: MITLL System Performance on the NIST 2005 Language Recognition Evaluation

William M. Campbell; Terry P. Gleason; Jiri Navratil; Douglas A. Reynolds; Wade Shen; Elliot Singer; Pedro A. Torres-Carrasquillo

This paper presents a description of the MIT Lincoln Laboratory submissions to the 2005 NIST Language Recognition Evaluation (LRE05). As was true in 2003, the 2005 submissions were combinations of core cepstral and phonotactic recognizers whose outputs were fused to generate final scores. For the 2005 evaluation, Lincoln Laboratory had five submissions built upon fused combinations of six core systems. Major improvements included the generation of phone streams using lattices, SVM-based language models using lattice-derived phonotactics, and binary tree language models. In addition, a development corpus was assembled that was designed to test robustness to unseen languages and sources. Language recognition trends based on NIST evaluations conducted since 1996 show a steady improvement in language recognition performance

Odyssey 2016 | 2016

The MITLL NIST LRE 2015 Language Recognition System.

Pedro A. Torres-Carrasquillo; Najim Dehak; Elizabeth Godoy; Douglas A. Reynolds; Fred Richardson; Stephen Shum; Elliot Singer; Douglas E. Sturim

Abstract : In this paper we describe the most recent MIT Lincoln Laboratory language recognition system developed for the NIST 2015 Language Recognition Evaluation (LRE). The submission features a fusion of five core classifiers, with most systems developed in the context of an i-vector framework. The 2015 evaluation presented new paradigms. First, the evaluation included fixed training and open training tracks for the first time; second, language classification performance was measured across 6 language clusters using 20 language classes instead of an N-way language task; and third, performance was measured across a nominal 3-30 second range. Results are presented for the overall performance across the six language clusters for both the fixed and open training tasks. On the 6-cluster metric the Lincoln system achieved overall costs of 0.173 and 0.168 for the fixed and open tasks respectively.

international conference on acoustics, speech, and signal processing | 2011

The MIT LL 2010 speaker recognition evaluation system: Scalable language-independent speaker recognition

Douglas E. Sturim; William M. Campbell; Najim Dehak; Zahi N. Karam; Alan McCree; Douglas A. Reynolds; Fred Richardson; Pedro A. Torres-Carrasquillo; Stephen Shum

Research in the speaker recognition community has continued to address methods of mitigating variational nuisances. Telephone and auxiliary-microphone recorded speech emphasize the need for a robust way of dealing with unwanted variation. The design of recent 2010 NIST-SRE Speaker Recognition Evaluation (SRE) reflects this research emphasis. In this paper, we present the MIT submission applied to the tasks of the 2010 NIST-SRE with two main goals—language-independent scalable modeling and robust nuisance mitigation. For modeling, exclusive use of inner product-based and cepstral systems produced a language-independent computationally-scalable system. For robustness, systems that captured spectral and prosodic information, modeled nuisance subspaces using multiple novel methods, and fused scores of multiple systems were implemented. The performance of the system is presented on a subset of the NIST SRE 2010 core tasks.

Wiley Encyclopedia of Electrical and Electronics Engineering | 2007

Automatic Language Identification

Pedro A. Torres-Carrasquillo; Marc A. Zissman

The sections in this article are 1 Language-Identification Cues 2 Language Identification Systems 3 Evaluations 4 Conclusions 5 Acknowledgment Keywords: phone recognition; spectral similiarity; speech-to-text systems; speech recognition

international conference on acoustics, speech, and signal processing | 2011

Informative dialect recognition using context-dependent pronunciation modeling

Nancy F. Chen; Wade Shen; Joseph P. Campbell; Pedro A. Torres-Carrasquillo

We propose an informative dialect recognition system that learns phonetic transformation rules, and uses them to identify dialects. A hidden Markov model is used to align reference phones with dialect-specific pronunciations to characterize when and how often substitutions, insertions, and deletions occur. Decision tree clustering is used to find context-dependent phonetic rules. We ran recognition tasks on 4 Arabic dialects. Not only do the proposed systems perform well on their own, but when fused with baselines they improve performance by 21–36% relative. In addition, our proposed decision-tree system beats the baseline monophone system in recovering phonetic rules by 21% relative. Pronunciation rules learned by our proposed system quantify the occurrence frequency of known rules, and suggest rule candidates for further linguistic studies.

Explore More