Is this you? Create Your Porfile

Stephen Shum

Massachusetts Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Stephen Shum is active.

Explore More

Publication

Featured researches published by Stephen Shum.

IEEE Transactions on Audio, Speech, and Language Processing | 2013

Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach

Stephen Shum; Najim Dehak; Réda Dehak; James R. Glass

In speaker diarization, standard approaches typically perform speaker clustering on some initial segmentation before refining the segment boundaries in a re-segmentation step to obtain a final diarization hypothesis. In this paper, we integrate an improved clustering method with an existing re-segmentation algorithm and, in iterative fashion, optimize both speaker cluster assignments and segmentation boundaries jointly. For clustering, we extend our previous research using factor analysis for speaker modeling. In continuing to take advantage of the effectiveness of factor analysis as a front-end for extracting speaker-specific features (i.e., i-vectors), we develop a probabilistic approach to speaker clustering by applying a Bayesian Gaussian Mixture Model (GMM) to principal component analysis (PCA)-processed i-vectors. We then utilize information at different temporal resolutions to arrive at an iterative optimization scheme that, in alternating between clustering and re-segmentation steps, demonstrates the ability to improve both speaker cluster assignments and segmentation boundaries in an unsupervised manner. Our proposed methods attain results that are comparable to those of a state-of-the-art benchmark set on the multi-speaker CallHome telephone corpus. We further compare our system with a Bayesian nonparametric approach to diarization and attempt to reconcile their differences in both methodology and performance.

Odyssey 2016 | 2016

The MITLL NIST LRE 2015 Language Recognition System.

Pedro A. Torres-Carrasquillo; Najim Dehak; Elizabeth Godoy; Douglas A. Reynolds; Fred Richardson; Stephen Shum; Elliot Singer; Douglas E. Sturim

Abstract : In this paper we describe the most recent MIT Lincoln Laboratory language recognition system developed for the NIST 2015 Language Recognition Evaluation (LRE). The submission features a fusion of five core classifiers, with most systems developed in the context of an i-vector framework. The 2015 evaluation presented new paradigms. First, the evaluation included fixed training and open training tracks for the first time; second, language classification performance was measured across 6 language clusters using 20 language classes instead of an N-way language task; and third, performance was measured across a nominal 3-30 second range. Results are presented for the overall performance across the six language clusters for both the fixed and open training tasks. On the 6-cluster metric the Lincoln system achieved overall costs of 0.173 and 0.168 for the fixed and open tasks respectively.

international conference on acoustics, speech, and signal processing | 2011

The MIT LL 2010 speaker recognition evaluation system: Scalable language-independent speaker recognition

Douglas E. Sturim; William M. Campbell; Najim Dehak; Zahi N. Karam; Alan McCree; Douglas A. Reynolds; Fred Richardson; Pedro A. Torres-Carrasquillo; Stephen Shum

Research in the speaker recognition community has continued to address methods of mitigating variational nuisances. Telephone and auxiliary-microphone recorded speech emphasize the need for a robust way of dealing with unwanted variation. The design of recent 2010 NIST-SRE Speaker Recognition Evaluation (SRE) reflects this research emphasis. In this paper, we present the MIT submission applied to the tasks of the 2010 NIST-SRE with two main goals—language-independent scalable modeling and robust nuisance mitigation. For modeling, exclusive use of inner product-based and cepstral systems produced a language-independent computationally-scalable system. For robustness, systems that captured spectral and prosodic information, modeled nuisance subspaces using multiple novel methods, and fused scores of multiple systems were implemented. The performance of the system is presented on a subset of the NIST SRE 2010 core tasks.

international conference on acoustics, speech, and signal processing | 2013

Large-scale community detection on speaker content graphs

Stephen Shum; William M. Campbell; Douglas A. Reynolds

We consider the use of community detection algorithms to perform speaker clustering on content graphs built from large audio corpora. We survey the application of agglomerative hierarchical clustering, modularity optimization methods, and spectral clustering as well as two random walk algorithms: Markov clustering and Infomap. Our results on graphs built from the NIST 2005+2006 and 2008+2010 Speaker Recognition Evaluations (SREs) provide insight into both the structure of the speakers present in the data and the intricacies of the clustering methods. In particular, we introduce an additional parameter to Infomap that improves its clustering performance on all graphs. Lastly, we also develop an automatic technique to purify the neighbors of each node by pruning away unnecessary edges.

IEEE Transactions on Audio, Speech, and Language Processing | 2016

On the Use of Acoustic Unit Discovery for Language Recognition

Stephen Shum; David F. Harwath; Najim Dehak; James R. Glass

In this paper, we explore the use of large-scale acoustic unit discovery for language recognition. The deep neural network-based approaches that have achieved recent success in this task require transcribed speech and pronunciation dictionaries, which may be limited in availability and expensive to obtain. We aim to replace the need for such supervision via the unsupervised discovery of acoustic units. In this work, we present a parallelized version of a Bayesian nonparametric model from previous work and use it to learn acoustic units from a few hundred hours of multilingual data. These unit (or senone) sequences are then used as targets to train a deep neural network-based i-vector language recognition system. We find that a score-level fusion of our unsupervised system with an acoustic baseline can shrink the gap significantly between the baseline and a supervised benchmark system built using transcribed English. Subsequent experiments also show that an improved acoustic representation of the data can yield substantial performance gains and that language specificity is important for discovering meaningful acoustic units. We validate the generalizability of our proposed approach by presenting state-of-the-art results that exhibit similar trends on the NIST Language Recognition Evaluations from 2011 and 2015.

conference of the international speech communication association | 2011