Philip N. Garner
Idiap Research Institute
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Philip N. Garner.
IEEE Transactions on Audio, Speech, and Language Processing | 2012
Thomas Hain; Lukas Burget; John Dines; Philip N. Garner; Frantisek Grezl; Asmaa El Hannani; Marijn Huijbregts; Martin Karafiát; Mike Lincoln; Vincent Wan
In this paper, we give an overview of the AMIDA systems for transcription of conference and lecture room meetings. The systems were developed for participation in the Rich Transcription evaluations conducted by the National Institute for Standards and Technology in the years 2007 and 2009 and can process close talking and far field microphone recordings. The paper first discusses fundamental properties of meeting data with special focus on the AMI/AMIDA corpora. This is followed by a description and analysis of improved processing and modeling, with focus on techniques specifically addressing meeting transcription issues such as multi-room recordings or domain variability. In 2007 and 2009, two different strategies of systems building were followed. While in 2007 we used our traditional style system design based on cross adaptation, the 2009 systems were constructed semi-automatically, supported by improved decoders and a new method for system representation. Overall these changes gave a 6%-13% relative reduction in word error rate compared to our 2007 results while at the same time requiring less training material and reducing the real-time factor by five times. The meeting transcription systems are available at www.webasr.org.
IEEE Transactions on Audio, Speech, and Language Processing | 2009
Kenichi Kumatani; John W. McDonough; Barbara Rauch; Dietrich Klakow; Philip N. Garner; Weifeng Li
In this paper, we address a beamforming application based on the capture of far-field speech data from a single speaker in a real meeting room. After the position of the speaker is estimated by a speaker tracking system, we construct a subband-domain beamformer in generalized sidelobe canceller (GSC) configuration. In contrast to conventional practice, we then optimize the active weight vectors of the GSC so as to obtain an output signal with maximum negentropy (MN). This implies the beamformer output should be as non-Gaussian as possible. For calculating negentropy, we consider the Gamma and the generalized Gaussian (GG) pdfs. After MN beamforming, Zelinski postfiltering is performed to further enhance the speech by removing residual noise. Our beamforming algorithm can suppress noise and reverberation without the signal cancellation problems encountered in the conventional beamforming algorithms. We demonstrate this fact through a set of acoustic simulations. Moreover, we show the effectiveness of our proposed technique through a series of far-field automatic speech recognition experiments on the Multi-Channel Wall Street Journal Audio Visual Corpus (MC-WSJ-AV), a corpus of data captured with real far-field sensors, in a realistic acoustic environment, and spoken by real speakers. On the MC-WSJ-AV evaluation data, the delay-and-sum beamformer with postfiltering achieved a word error rate (WER) of 16.5%. MN beamforming with the Gamma pdf achieved a 15.8% WER, which was further reduced to 13.2% with the GG pdf, whereas the simple delay-and-sum beamformer provided a WER of 17.8%. To the best of our knowledge, no lower error rates at present have been reported in the literature on this automatic speech recognition (ASR) task.
international conference on acoustics, speech, and signal processing | 2012
David Imseng; Philip N. Garner
Setting out from the point of view that automatic speech recognition (ASR) ought to benefit from data in languages other than the target language, we propose a novel Kullback-Leibler (KL) divergence based method that is able to exploit multilingual information in the form of universal phoneme posterior probabilities conditioned on the acoustics. We formulate a means to train a recognizer on several different languages, and subsequently recognize speech in a target language for which only a small amount of data is available. Taking the Greek SpeechDat(II) data as an example, we show that the proposed formulation is sound, and show that it is able to out-perform a current state-of-the-art HMM/GMM system. We also use a hybrid Tandem-like system to further understand the source of the benefit.
2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays | 2011
Mohammad Javad Taghizadeh; Philip N. Garner; Hamid Reza Abutalebi; Afsaneh Asaei
Two of the major challenges in microphone array based adaptive beamforming, speech enhancement and distant speech recognition, are robust and accurate source localization and voice activity detection. This paper introduces a spatial gradient steered response power using the phase transform (SRPPHAT) method which is capable of localization of competing speakers in overlapping conditions. We further investigate the behavior of the SRP function and characterize theoretically a fixed point in its search space for the diffuse noise field. We call this fixed point the null position in the SRP search space. Building on this evidence, we propose a technique for multichannel voice activity detection (MVAD) based on detection of a maximum power corresponding to the null position. The gradient SRP-PHAT in tandem with the MVAD form an integrated framework of multi-source localization and voice activity detection. The experiments carried out on real data recordings show that this framework is very effective in practical applications of hands-free communication.
international conference on acoustics, speech, and signal processing | 2008
Kenichi Kumatani; John W. McDonough; S. Schachl; Dietrich Klakow; Philip N. Garner; Weifeng Li
This paper presents new filter bank design methods for sub- band adaptive beamforming. In this work, we design analysis and synthesis prototypes for modulated filter banks so as to minimize each aliasing term individually. We then drive the total response error to null by constraining these prototypes to be Nyquist(M) filters. Thereafter those modulated filter banks are applied to a speech separation system which extracts a target speech signal. In our system, speech signals are first transformed into the subband domain with our filter banks, and the subband components are then processed with a beamforming algorithm. Following beamforming, post-filtering and binary masking are further performed to remove residual noises. We show that our filter banks can suppress the residual aliasing distortion more than conventional ones. Furthermore, we demonstrate the effectiveness of our design techniques through a set of automatic speech recognition experiments on the multi-channel speech data from the PASCAL Speech Separation Challenge. The experimental results prove that our beamforming system with the proposed filter banks achieves the best recognition performance, a 39.6 % word error rate (WER), with half the amount of computation of that of the conventional filter banks while the perfect reconstruction filter banks provided a 44.4 % WER.
IEEE Signal Processing Letters | 2013
Philip N. Garner; Milos Cernak; Petr Motlicek
Recent work in text to speech synthesis has pointed to the benefit of using a continuous pitch estimate; that is, one that records pitch even when voicing is not present. Such an approach typically requires interpolation. The purpose of this letter is to show that a continuous pitch estimation is available from a combination of otherwise well known techniques. Further, in the case of an autocorrelation based estimate, the continuous requirement negates the need for other heuristics to correct for common errors. An algorithm is suggested, illustrated, and demonstrated using a parametric vocoder.
international conference on acoustics, speech, and signal processing | 2010
Lakshmi Saheer; Philip N. Garner; John Dines; Hui Liang
The advent of statistical speech synthesis has enabled the unification of the basic techniques used in speech synthesis and recognition. Adaptation techniques that have been successfully used in recognition systems can now be applied to synthesis systems to improve the quality of the synthesized speech. The application of vocal tract length normalization (VTLN) for synthesis is explored in this paper. VTLN based adaptation requires estimation of a single warping factor, which can be accurately estimated from very little adaptation data and gives additive improvements over CMLLR adaptation. The challenge of estimating accurate warping factors using higher order features is solved by initializing warping factor estimation with the values calculated from lower order features.
Speech Communication | 2014
David Imseng; Petr Motlicek; Philip N. Garner
Under-resourced speech recognizers may benefit from data in languages other than the target language. In this paper, we report how to boost the performance of an Afrikaans automatic speech recognition system by using already available Dutch data. We successfully exploit available multilingual resources through (1) posterior features, estimated by multilayer perceptrons (MLP) and (2) subspace Gaussian mixture models (SGMMs). Both the MLPs and the SGMMs can be trained on out-of-language data. We use three different acoustic modeling techniques, namely Tandem, Kullback-Leibler divergence based HMMs (KL-HMM) as well as SGMMs and show that the proposed multilingual systems yield 12% relative improvement compared to a conventional monolingual HMM/GMM system only trained on Afrikaans. We also show that KL-HMMs are extremely powerful for under-resourced languages: using only six minutes of Afrikaans data (in combination with out-of-language data), KL-HMM yields about 30% relative improvement compared to conventional maximum likelihood linear regression and maximum a posteriori based acoustic model adaptation.
Speech Communication | 2011
Philip N. Garner
Cepstral normalisation in automatic speech recognition is investigated in the context of robustness to additive noise. In this paper, it is argued that such normalisation leads naturally to a speech feature based on signal to noise ratio rather than absolute energy (or power). Explicit calculation of this SNR-cepstrum by means of a noise estimate is shown to have theoretical and practical advantages over the usual (energy based) cepstrum. The relationship between the SNR-cepstrum and the articulation index, known in psycho-acoustics, is discussed. Experiments are presented suggesting that the combination of the SNR-cepstrum with the well known perceptual linear prediction method can be beneficial in noisy environments.
spoken language technology workshop | 2012
David Imseng; Holger Caesar; Philip N. Garner; Gwénolé Lecorvé; Alexandre Nanchen
MediaParl is a Swiss accented bilingual database containing recordings in both French and German as they are spoken in Switzerland. The data were recorded at the Valais Parliament. Valais is a bilingual Swiss canton with many local accents and dialects. Therefore, the database contains data with high variability and is suitable to study multilingual, accented and non-native speech recognition as well as language identification and language switch detection. We also define monolingual and mixed language automatic speech recognition and language identification tasks and evaluate baseline systems. The database is publicly available for download.