Mitchell McLaren | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mitchell McLaren is active.

Explore More

Publication

Featured researches published by Mitchell McLaren.

international conference on acoustics, speech, and signal processing | 2015

Advances in deep neural network approaches to speaker recognition

Mitchell McLaren; Yun Lei; Luciana Ferrer

The recent application of deep neural networks (DNN) to speaker identification (SID) has resulted in significant improvements over current state-of-the-art on telephone speech. In this work, we report a similar achievement in DNN-based SID performance on microphone speech. We consider two approaches to DNN-based SID: one that uses the DNN to extract features, and another that uses the DNN during feature modeling. Modeling is conducted using the DNN/i-vector framework, in which the traditional universal background model is replaced with a DNN. The recently proposed use of bottleneck features extracted from a DNN is also evaluated. Systems are first compared with a conventional universal background model (UBM) Gaussian mixture model (GMM) i-vector system on the clean conditions of the NIST 2012 speaker recognition evaluation corpus, where a lack of robustness to microphone speech is found. Several methods of DNN feature processing are then applied to bring significantly greater robustness to microphone speech. To direct future research, the DNN-based systems are also evaluated in the context of audio degradations including noise and reverberation.

IEEE Transactions on Audio, Speech, and Language Processing | 2016

Study of senone-based deep neural network approaches for spoken language recognition

Luciana Ferrer; Yun Lei; Mitchell McLaren; Nicolas Scheffer

This paper compares different approaches for using deep neural networks (DNNs) trained to predict senone posteriors for the task of spoken language recognition (SLR). These approaches have recently been found to outperform various baseline systems on different datasets, but they have not yet been compared to each other or to a common baseline. Two of these approaches use the DNNs to generate feature vectors which are then processed in different ways to predict the score of each language given a test sample. The features are extracted either from a bottleneck layer in the DNN or from the output layer. In the third approach, the standard i-vector extraction procedure is modified to use the senones as classes and the DNN to predict the zeroth order statistics. We compare these three approaches and conclude that the approach based on bottleneck features followed by i-vector modeling outperform the other two approaches. We also show that score-level fusion of some of these approaches leads to gains over using a single approach for short-duration test samples. Finally, we demonstrate that fusing systems that use DNNs trained with several languages leads to improvements in performance over the best single system, and we propose an adaptation procedure for DNNs trained with languages with less available data. Overall, we show improvements between 40% and 70% relative to a state-of-the-art Gaussian mixture model (GMM) i-vector system on test durations from 3 seconds to 120 seconds on two significantly different tasks: the NIST 2009 language recognition evaluation task and the DARPA RATS language identification task.

Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge | 2014

The SRI AVEC-2014 Evaluation System

Vikramjit Mitra; Elizabeth Shriberg; Mitchell McLaren; Andreas Kathol; Colleen Richey; Dimitra Vergyri; Martin Graciarena

Though depression is a common mental health problem with significant impact on human society, it often goes undetected. We explore a diverse set of features based only on spoken audio to understand which features correlate with self-reported depression scores according to the Beck depression rating scale. These features, many of which are novel for this task, include (1) estimated articulatory trajectories during speech production, (2) acoustic characteristics, (3) acoustic-phonetic characteristics and (4) prosodic features. Features are modeled using a variety of approaches, including support vector regression, a Gaussian backend and decision trees. We report results on the AVEC-2014 depression dataset and find that individual systems range from 9.18 to 11.87 in root mean squared error (RMSE), and from 7.68 to 9.99 in mean absolute error (MAE). Initial fusion brings further improvement; fusion and feature selection work is still in progress.

international conference on acoustics, speech, and signal processing | 2016

Exploring the role of phonetic bottleneck features for speaker and language recognition

Mitchell McLaren; Luciana Ferrer; Aaron Lawson

Using bottleneck features extracted from a deep neural network (DNN) trained to predict senone posteriors has resulted in new, state-of-the-art technology for language and speaker identification. For language identification, the features dense phonetic information is believed to enable improved performance by better representing language-dependent phone distributions. For speaker recognition, the role of these features is less clear, given that a bottleneck layer near the DNN output layer is thought to contain limited speaker information. In this article, we analyze the role of bottleneck features in these identification tasks by varying the DNN layer from which they are extracted, under the hypothesis that speaker information is traded for dense phonetic information as the layer moves toward the DNN output layer. Experiments support this hypothesis under certain conditions, and highlight the benefit of using a bottleneck layer close to the DNN output layer when DNN training data is matched to the evaluation conditions, and a layer more central to the DNN otherwise.

international conference on acoustics, speech, and signal processing | 2015

Softsad: Integrated frame-based speech confidence for speaker recognition

Mitchell McLaren; Martin Graciarena; Yun Lei

In this paper we propose softSAD: the direct integration of speech posteriors into a speaker recognition system as an alternative to using speech activity detection (SAD). Motivated by the need to use audio from short recordings more efficiently, softSAD removes the need to discard audio using speech/non-speech decisions based on a threshold as done with SAD. Instead, softSAD explicitly integrates into the Baum-Welch statistics a speech posterior for each frame. We compare softSAD and SAD in mismatched conditions by evaluating a system developed for the National Institute for Standards and Technology (NIST) 2012 speaker recognition evaluation (SRE) on the short test conditions of the channel-degraded Robust Automatic Transcription of Speech (RATS) speaker identification task (and vice versa). We demonstrate that softSAD provides benefit over SAD for short test audio in mismatched conditions.

ieee automatic speech recognition and understanding workshop | 2015

Improving robustness against reverberation for automatic speech recognition

Vikramjit Mitra; Julien van Hout; Wen Wang; Martin Graciarena; Mitchell McLaren; Horacio Franco; Dimitra Vergyri

Reverberation is a phenomenon observed in almost all enclosed environments. Human listeners rarely experience problems in comprehending speech in reverberant environments, but automatic speech recognition (ASR) systems often suffer increased error rates under such conditions. In this work, we explore the role of robust acoustic features motivated by human speech perception studies, for building ASR systems robust to reverberation effects. Using the dataset distributed for the Automatic Speech Recognition In Reverberant Environments (ASpIRE-2015) challenge organized by IARPA, we explore Gaussian mixture models (GMMs), deep neural nets (DNNs) and convolutional deep neural networks (CDNN) as candidate acoustic models for recognizing continuous speech in reverberant environments. We demonstrate that DNN-based systems trained with robust features offer significant reduction in word error rates (WERs) compared to systems trained with baseline mel-filterbank features. We present a novel time-frequency convolution neural net (TFCNN) framework that performs convolution on the feature space across both the time and frequency scales, which we found to consistently outperform the CDNN systems for all feature sets across all testing conditions. Finally, we show that further WER reduction is achievable through system fusion of n-best lists from multiple systems.

international conference on acoustics, speech, and signal processing | 2015

Improved speaker recognition using DCT coefficients as features

Mitchell McLaren; Yun Lei

We recently proposed the use of coefficients extracted from the 2D discrete cosine transform (DCT) of log Mel filter bank energies to improve speaker recognition over the traditional Mel frequency cepstral coefficients (MFCC) with appended deltas and double deltas (MFCC/deltas). Selection of relevant coefficients was shown to be crucial, resulting in the proposal of a zig-zag parsing strategy. While 2D-DCT coefficients provided significant gains over MFCC/deltas, the parsing strategy remains sensitive to the number of filter bank outputs and the analysis window size. In this work, we analyze this sensitivity and propose two new data-driven methods of utilizing DCT coefficients for speaker recognition: rankDCT and pcaDCT. The first, rankDCT, is an automated coefficient selection strategy based on the highest average intra-frame energy rank. The alternate method, pcaDCT, avoids the need for selection and instead projects DCT coefficients to the desired dimensionality via principal component analysis (PCA). All features including MFCC/deltas are tuned on a subset of the PRISM database to subsequently highlight any parameter sensitivities of each feature. Evaluated on the recent NIST SRE12 corpus, pcaDCT consistently outperforms both rankDCT and zzDCT features and offers an average 20% relative improvement over MFCC/deltas across conditions.

Odyssey 2016 | 2016

Analyzing the Effect of Channel Mismatch on the SRI Language Recognition Evaluation 2015 System.

Mitchell McLaren; Diego Castán; Luciana Ferrer

We present the work done by our group for the 2015 language recognition evaluation (LRE) organized by the National Institute of Standards and Technology (NIST), along with an extended post-evaluation analysis. The focus of this evaluation was the development of language recognition systems for clusters of closely related languages using training data released by NIST. This training data contained a highly imbalanced sample from the languages of interest. The SRI team submitted several systems to LRE’15. Major components included (1) bottleneck features extracted from Deep Neural Networks (DNNs) trained to predict English senones, with multiple DNNs trained using a variety of acoustic features; (2) data-driven Discrete Cosine Transform (DCT) contextualization of features for traditional Universal Background Model (UBM) i-vector extraction and for input to a DNN for bottleneck feature extraction; (3) adaptive Gaussian backend scoring; (4) a newly developed multiresolution neural network backend; and (5) cluster-specific Nway fusion of scores. We compare results on our development dataset with those on the evaluation data and find significantly different conclusions about which techniques were useful for each dataset. This difference was due mostly to a large unexpected mismatch in acoustic and channel conditions between the two datasets. We provide a post-evaluation analysis revealing that the successful approaches for this evaluation included the use of bottleneck features, and a well-defined development dataset appropriate for mismatched conditions.

conference of the international speech communication association | 2014