Maarten Van Segbroeck
University of Southern California
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Maarten Van Segbroeck.
Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge | 2014
Rahul Gupta; Nikolaos Malandrakis; Bo Xiao; Tanaya Guha; Maarten Van Segbroeck; Matthew P. Black; Alexandros Potamianos; Shrikanth Narayanan
Depression is one of the most common mood disorders. Technology has the potential to assist in screening and treating people with depression by robustly modeling and tracking the complex behavioral cues associated with the disorder (e.g., speech, language, facial expressions, head movement, body language). Similarly, robust affect recognition is another challenge which stands to benefit from modeling such cues. The Audio/Visual Emotion Challenge (AVEC) aims toward understanding the two phenomena and modeling their correlation with observable cues across several modalities. In this paper, we use multimodal signal processing methodologies to address the two problems using data from human-computer interactions. We develop separate systems for predicting depression levels and affective dimensions, experimenting with several methods for combining the multimodal information. The proposed depression prediction system uses a feature selection approach based on audio, visual, and linguistic cues to predict depression scores for each session. Similarly, we use multiple systems trained on audio and visual cues to predict the affective dimensions in continuous-time. Our affect recognition system accounts for context during the frame-wise inference and performs a linear fusion of outcomes from the audio-visual systems. For both problems, our proposed systems outperform the video-feature based baseline systems. As part of this work, we analyze the role played by each modality in predicting the target variable and provide analytical insights.
international conference on acoustics, speech, and signal processing | 2015
Samuel Thomas; George Saon; Maarten Van Segbroeck; Shrikanth Narayanan
In this paper we describe improvements to the IBM speech activity detection (SAD) system for the third phase of the DARPA RATS program. The progress during this final phase comes from jointly training convolutional and regular deep neural networks with rich time-frequency representations of speech. With these additions, the phase 3 system reduces the equal error rate (EER) significantly on both of the programs development sets (relative improvements of 20% on dev1 and 7% on dev2) compared to an earlier phase 2 system. For the final program evaluation, the newly developed system also performs well past the program target of 3% Pmiss at 1% Pfa with a performance of 1.2% Pmiss at 1% Pfa and 0.3% Pfa at 3% Pmiss.
international conference on acoustics, speech, and signal processing | 2013
Ming Li; Andreas Tsiartas; Maarten Van Segbroeck; Shrikanth Narayanan
This paper presents a simplified and supervised i-vector modeling framework that is applied in the task of robust and efficient speaker verification (SRE). First, by concatenating the mean supervector and the i-vector factor loading matrix with respectively the label vector and the linear classifier matrix, the traditional i-vectors are then extended to label-regularized supervised i-vectors. These supervised i-vectors are optimized to not only reconstruct the mean supervectors well but also minimize the mean squared error between the original and the reconstructed label vectors, such that they become more discriminative. Second, factor analysis (FA) can be performed on the pre-normalized centered GMM first order statistics supervector to ensure that the Gaussian statistics sub-vector of each Gaussian component is treated equally in the FA, which reduces the computational cost significantly. Experimental results are reported on the female part of the NIST SRE 2010 task with common condition 5. The proposed supervised i-vector approach outperforms the i-vector baseline by relatively 12% and 7% in terms of equal error rate (EER) and norm old minDCF values, respectively.
spoken language technology workshop | 2012
Fabrizio Morbini; Kartik Audhkhasi; Ron Artstein; Maarten Van Segbroeck; Kenji Sagae; Panayiotis G. Georgiou; David R. Traum; Shrikanth Narayanan
We address the challenge of interpreting spoken input in a conversational dialogue system with an approach that aims to exploit the close relationship between the tasks of speech recognition and language understanding through joint modeling of these two tasks. Instead of using a standard pipeline approach where the output of a speech recognizer is the input of a language understanding module, we merge multiple speech recognition and utterance classification hypotheses into one list to be processed by a joint reranking model. We obtain substantially improved performance in language understanding in experiments with thousands of user utterances collected from a deployed spoken dialogue system.
IEEE Transactions on Audio, Speech, and Language Processing | 2015
Maarten Van Segbroeck; Ruchir Travadi; Shrikanth Narayanan
A critical challenge to automatic language identification (LID) is achieving accurate performance with the shortest possible speech segment in a rapid fashion. The accuracy to correctly identify the spoken language is highly sensitive to the duration of speech and is bounded by the amount of information available. The proposed approach for rapid language identification transforms the utterances to a low dimensional i-vector representation upon which language classification methods are applied. In order to meet the challenges involved in rapidly making reliable decisions about the spoken language, a highly accurate and computationally efficient framework of i-vector extraction is proposed. The LID framework integrates the approach of universal background model (UBM) fused total variability modeling. UBM-fused modeling yields the estimation of a more discriminant, single i-vector space. This way, it is also a computationally more efficient alternative than system level fusion. A further reduction in equal error rate is achieved by training the i-vector model on long duration speech utterances and by the deployment of a robust feature extraction scheme that aims to capture the relevant language cues under various acoustic conditions. Evaluation results on the DARPA RATS data corpus suggest the potential of performing successful automated language identification at the level of one second of speech or even shorter duration.
international conference on acoustics, speech, and signal processing | 2013
Maarten Van Segbroeck; Shrikanth Narayanan
The sensitivity of Automatic Speech Recognition (ASR) systems to the presence of background noises in the speaking environment, still remains a challenging task. Extracting noise robust features to compensate for speech degradations due to the noise, regained popularity in recent years. This paper contributes to this trend by proposing a cost-efficient denoising method that can serve as a preprocessing stage in any feature extraction scheme to boost its ASR performance. Recognition performance on Aurora2 shows that a noise robust frontend is obtained when combined with noise masking and feature normalization. Without the requirement of high computational costs, the method achieves similar recognition results when compared to other state-of-the art noise compensation methods.
international conference on acoustics, speech, and signal processing | 2014
Naveen Kumar; Maarten Van Segbroeck; Kartik Audhkhasi; Peter Drotár; Shrikanth Narayanan
We present a framework for combining different denoising front-ends for robust speech enhancement for recognition in noisy conditions. This is contrasted against results of optimally fusing diverse parameter settings for a single denoising algorithm. All frontends in the latter case exploit the same denoising algorithm, which combines harmonic decomposition, with noise estimation and spectral subtraction. The set of associated parameters involved in these steps are dependent on the noise conditions. Rather than explicitly tuning them, we suggest a strategy that tries to account for the trade-off between average word error rate and diversity to find an optimal subset of these parameter settings. We present the results on Aurora4 database and also compare against traditional speech enhancement methods e.g. Wiener filtering and spectral subtraction.
conference of the international speech communication association | 2013
Maarten Van Segbroeck; Andreas Tsiartas; Shrikanth Narayanan
conference of the international speech communication association | 2013
Andreas Tsiartas; Theodora Chaspari; Nassos Katsamanis; M ing Li; Maarten Van Segbroeck; Alexandros Potamianos; Shrikanth Narayanan
conference of the international speech communication association | 2014
Maarten Van Segbroeck; Ruchir Travadi; Colin Vaz; Jangwon Kim; Matthew P. Black; Alexandros Potamianos; Shrikanth Narayanan