David Imseng
Idiap Research Institute
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by David Imseng.
international conference on acoustics, speech, and signal processing | 2014
Ngoc Thang Vu; David Imseng; Daniel Povey; Petr Motlicek; Tanja Schultz
This paper presents a study on multilingual deep neural network (DNN) based acoustic modeling and its application to new languages. We investigate the effect of phone merging on multilingual DNN in context of rapid language adaptation. Moreover, the combination of multilingual DNNs with Kullback-Leibler divergence based acoustic modeling (KL-HMM) is explored. Using ten different languages from the Globalphone database, our studies reveal that crosslingual acoustic model transfer through multilingual DNNs is superior to unsupervised RBM pre-training and greedy layer-wise supervised training. We also found that KL-HMM based decoding consistently outperforms conventional hybrid decoding, especially in low-resource scenarios. Furthermore, the experiments indicate that multilingual DNN training equally benefits from simple phoneset concatenation and manually derived universal phonesets.
international conference on acoustics, speech, and signal processing | 2012
David Imseng; Philip N. Garner
Setting out from the point of view that automatic speech recognition (ASR) ought to benefit from data in languages other than the target language, we propose a novel Kullback-Leibler (KL) divergence based method that is able to exploit multilingual information in the form of universal phoneme posterior probabilities conditioned on the acoustics. We formulate a means to train a recognizer on several different languages, and subsequently recognize speech in a target language for which only a small amount of data is available. Taking the Greek SpeechDat(II) data as an example, we show that the proposed formulation is sound, and show that it is able to out-perform a current state-of-the-art HMM/GMM system. We also use a hybrid Tandem-like system to further understand the source of the benefit.
IEEE Transactions on Audio, Speech, and Language Processing | 2012
Gerald Friedland; Adam Janin; David Imseng; Xavier Anguera Miro; Luke R. Gottlieb; Marijn Huijbregts; Mary Tai Knox; Oriol Vinyals
The speaker diarization system developed at the International Computer Science Institute (ICSI) has played a prominent role in the speaker diarization community, and many researchers in the rich transcription community have adopted methods and techniques developed for the ICSI speaker diarization engine. Although there have been many related publications over the years, previous articles only presented changes and improvements rather than a description of the full system. Attempting to replicate the ICSI speaker diarization system as a complete entity would require an extensive literature review, and might ultimately fail due to component description version mismatches. This paper therefore presents the first full conceptual description of the ICSI speaker diarization system as presented to the National Institute of Standards Technology Rich Transcription 2009 (NIST RT-09) evaluation, which consists of online and offline subsystems, multi-stream and single-stream implementations, and audio and audio-visual approaches. Some of the components, such as the online system, have not been previously described. The paper also includes all necessary preprocessing steps, such as Wiener filtering, speech activity detection and beamforming.
ieee automatic speech recognition and understanding workshop | 2011
David Imseng; Ramya Rasipuram; Mathew Magimai.-Doss
One of the main challenge in non-native speech recognition is how to handle acoustic variability present in multi-accented non-native speech with limited amount of training data. In this paper, we investigate an approach that addresses this challenge by using Kullback-Leibler divergence based hidden Markov models (KL-HMM). More precisely, the acoustic variability in the multi-accented speech is handled by using multilingual phoneme posterior probabilities, estimated by a multilayer perceptron trained on auxiliary data, as input feature for the KL-HMM system. With limited training data, we then build better acoustic models by exploiting the advantage that the KL-HMM system has fewer number of parameters. On HIWIRE corpus, the proposed approach yields a performance of 1.9% word error rate (WER) with 149 minutes of training data and a performance of 5.5% WER with 2 minutes of training data.
Speech Communication | 2014
David Imseng; Petr Motlicek; Philip N. Garner
Under-resourced speech recognizers may benefit from data in languages other than the target language. In this paper, we report how to boost the performance of an Afrikaans automatic speech recognition system by using already available Dutch data. We successfully exploit available multilingual resources through (1) posterior features, estimated by multilayer perceptrons (MLP) and (2) subspace Gaussian mixture models (SGMMs). Both the MLPs and the SGMMs can be trained on out-of-language data. We use three different acoustic modeling techniques, namely Tandem, Kullback-Leibler divergence based HMMs (KL-HMM) as well as SGMMs and show that the proposed multilingual systems yield 12% relative improvement compared to a conventional monolingual HMM/GMM system only trained on Afrikaans. We also show that KL-HMMs are extremely powerful for under-resourced languages: using only six minutes of Afrikaans data (in combination with out-of-language data), KL-HMM yields about 30% relative improvement compared to conventional maximum likelihood linear regression and maximum a posteriori based acoustic model adaptation.
IEEE Transactions on Audio, Speech, and Language Processing | 2010
David Imseng; Gerald Friedland
This paper investigates a typical speaker diarization system regarding its robustness against initialization parameter variation and presents a method to reduce manual tuning of these values significantly. The behavior of an agglomerative hierarchical clustering system is studied to determine which initialization parameters impact accuracy most. We show that the accuracy of typical systems is indeed very sensitive to the values chosen for the initialization parameters and factors such as the duration of speech in the recording. We then present a solution that reduces the sensitivity of the initialization values and therefore reduces the need for manual tuning significantly while at the same time increasing the accuracy of the system. For short meetings extracted from the previous (2006, 2007, and 2009) National Institute of Standards and Technology (NIST) Rich Transcription (RT) evaluation data, the decrease of the diarization error rate is up to 50% relative. The approach consists of a novel initialization parameter estimation method for speaker diarization that uses agglomerative clustering with Bayesian information criterion (BIC) and Gaussian mixture models (GMMs) of frame-based cepstral features (MFCCs). The estimation method balances the relationship between the optimal value of the seconds of speech data per Gaussian and the duration of the speech data and is combined with a novel nonuniform initialization method. This approach results in a system that performs better than the current ICSI baseline engine on datasets of the NIST RT evaluations of the years 2006, 2007, and 2009.
spoken language technology workshop | 2012
David Imseng; Holger Caesar; Philip N. Garner; Gwénolé Lecorvé; Alexandre Nanchen
MediaParl is a Swiss accented bilingual database containing recordings in both French and German as they are spoken in Switzerland. The data were recorded at the Valais Parliament. Valais is a bilingual Swiss canton with many local accents and dialects. Therefore, the database contains data with high variability and is suitable to study multilingual, accented and non-native speech recognition as well as language identification and language switch detection. We also define monolingual and mixed language automatic speech recognition and language identification tasks and evaluate baseline systems. The database is publicly available for download.
ieee automatic speech recognition and understanding workshop | 2009
David Imseng; Gerald Friedland
We investigate a state-of-the-art Speaker Diarization system regarding its behavior on meetings that are much shorter (from 500 seconds down to 100 seconds) than those typically analyzed in Speaker Diarization benchmarks. First, the problems inherent to this task are analyzed. Then, we propose an approach that consists of a novel initialization parameter estimation method for typical state-of-the-art diarization approaches. The estimation method balances the relationship between the optimal value of the duration of speech data per Gaussian and the duration of the speech data, which is verified experimentally for the first time in this article. As a result, the Diarization Error Rate for short meetings extracted from the 2006, 2007, and 2009 NIST RT evaluation data is decreased by up to 50% relative.
international conference on acoustics, speech, and signal processing | 2011
David Imseng; Mathew Magimai.-Doss; John Dines
This paper presents a new approach to estimate “universal” phoneme posterior probabilities for mixed language speech recognition. More specifically, we propose a new theoretical framework to combine phoneme class posterior probabilities in a principled way by using (statistical) evidence about the language identity. We investigate the proposed approach in a mixed language environment (Speech-Dat(II)) consisting of five European languages. Our studies show that the proposed approach can yield significant improvements on a mixed language task, while maintaining the performance on monolingual tasks. Additionally, through a case study, we also demonstrate the potential benefits of the proposed approach for non-native speech recognition.
international conference on acoustics, speech, and signal processing | 2015
Ivan Himawan; Petr Motlicek; David Imseng; Blaise Potard; Nam-hoon Kim; Jae-won Lee
Automatic speech recognition from distant microphones is a difficult task because recordings are affected by reverberation and background noise. First, the application of the deep neural network (DNN)/hidden Markov model (HMM) hybrid acoustic models for distant speech recognition task using AMI meeting corpus is investigated. This paper then proposes a feature transformation for removing reverberation and background noise artefacts from bottleneck features using DNN trained to learn the mapping between distant-talking speech features and close-talking speech bottleneck features. Experimental results on AMI meeting corpus reveal that the mismatch between close-talking and distant-talking conditions is largely reduced, with about 16% relative improvement over conventional bottleneck system (trained on close-talking speech). If the feature mapping is applied to close-talking speech, a minor degradation of 4% relative is observed.