John W. McDonough | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where John W. McDonough is active.

Explore More

Publication

Featured researches published by John W. McDonough.

Archive | 2009

Distant speech recognition

Matthias Wölfel; John W. McDonough

Foreword. Preface. 1 Introduction. 1.1 Research and Applications in Academia and Industry. 1.2 Challenges in Distant Speech Recognition. 1.3 System Evaluation. 1.4 Fields of Speech Recognition. 1.5 Robust Perception. 1.6 Organizations, Conferences and Journals. 1.7 Useful Tools, Data Resources and Evaluation Campaigns. 1.8 Organization of this Book. 1.9 Principal Symbols used Throughout the Book. 1.10 Units used Throughout the Book. 2 Acoustics. 2.1 Physical Aspect of Sound. 2.2 Speech Signals. 2.3 Human Perception of Sound. 2.4 The Acoustic Environment. 2.5 Recording Techniques and Sensor Configuration. 2.6 Summary and Further Reading. 2.7 Principal Symbols. 3 Signal Processing and Filtering Techniques. 3.1 Linear Time-Invariant Systems. 3.2 The Discrete Fourier Transform. 3.3 Short-Time Fourier Transform. 3.4 Summary and Further Reading. 3.5 Principal Symbols. 4 Bayesian Filters. 4.1 Sequential Bayesian Estimation. 4.2 Wiener Filter. 4.3 Kalman Filter and Variations. 4.4 Particle Filters. 4.5 Summary and Further Reading. 4.6 Principal Symbols. 5 Speech Feature Extraction. 5.1 Short-Time Spectral Analysis. 5.2 Perceptually Motivated Representation. 5.3 Spectral Estimation and Analysis. 5.4 Cepstral Processing. 5.5 Comparison between Mel Frequency, Perceptual LP and warped MVDR Cepstral Coefficient Frontends. 5.6 Feature Augmentation. 5.7 Feature Reduction. 5.8 Feature-Space Minimum Phone Error. 5.9 Summary and Further Reading. 5.10 Principal Symbols. 6 Speech Feature Enhancement. 6.1 Noise and Reverberation in Various Domains. 6.2 Two Principal Approaches. 6.3 Direct Speech Feature Enhancement. 6.4 Schematics of Indirect Speech Feature Enhancement. 6.5 Estimating Additive Distortion. 6.6 Estimating Convolutional Distortion. 6.7 Distortion Evolution. 6.8 Distortion Evaluation. 6.9 Distortion Compensation. 6.10 Joint Estimation of Additive and Convolutional Distortions. 6.11 Observation Uncertainty. 6.12 Summary and Further Reading. 6.13 Principal Symbols. 7 Search: Finding the Best Word Hypothesis. 7.1 Fundamentals of Search. 7.2 Weighted Finite-State Transducers. 7.3 Knowledge Sources. 7.4 Fast On-the-Fly Composition. 7.5 Word and Lattice Combination. 7.6 Summary and Further Reading. 7.7 Principal Symbols. 8 Hidden Markov Model Parameter Estimation. 8.1 Maximum Likelihood Parameter Estimation. 8.2 Discriminative Parameter Estimation. 8.3 Summary and Further Reading. 8.4 Principal Symbols. 9 Feature and Model Transformation. 9.1 Feature Transformation Techniques. 9.2 Model Transformation Techniques. 9.3 Acoustic Model Combination. 9.4 Summary and Further Reading. 9.5 Principal Symbols. 10 Speaker Localization and Tracking. 10.1 Conventional Techniques. 10.2 Speaker Tracking with the Kalman Filter. 10.3 Tracking Multiple Simultaneous Speakers. 10.4 Audio-Visual Speaker Tracking. 10.5 Speaker Tracking with the Particle Filter. 10.6 Summary and Further Reading. 10.7 Principal Symbols. 11 Digital Filter Banks. 11.1 Uniform Discrete Fourier Transform Filter Banks. 11.2 Polyphase Implementation. 11.3 Decimation and Expansion. 11.4 Noble Identities. 11.5 Nyquist( M ) Filters. 11.6 Filter Bank Design of De Haan et al . 11.7 Filter Bank Design with the Nyquist( M ) Criterion. 11.8 Quality Assessment of Filter Bank Prototypes. 11.9 Summary and Further Reading. 11.10 Principal Symbols. 12 Blind Source Separation. 12.1 Channel Quality and Selection. 12.2 Independent Component Analysis. 12.3 BSS Algorithms based on Second-Order Statistics. 12.4 Summary and Further Reading. 12.5 Principal Symbols. 13 Beamforming. 13.1 Beamforming Fundamentals. 13.2 Beamforming Performance Measures. 13.3 Conventional Beamforming Algorithms. 13.4 Recursive Algorithms. 13.5 Nonconventional Beamforming Algorithms. 13.6 Array Shape Calibration. 13.7 Summary and Further Reading. 13.8 Principal Symbols. 14 Hands On. 14.1 Example Room Configurations. 14.2 Automatic Speech Recognition Engines. 14.3 Word Error Rate. 14.4 Single-Channel Feature Enhancement Experiments. 14.5 Acoustic Speaker-Tracking Experiments. 14.6 Audio-Video Speaker-Tracking Experiments. 14.7 Speaker-Tracking Performance vs Word Error Rate. 14.8 Single-Speaker Beamforming Experiments. 14.9 Speech Separation Experiments. 14.10 Filter Bank Experiments. 14.11 Summary and Further Reading. Appendices. A List of Abbreviations. B Useful Background. B.1 Discrete Cosine Transform. B.2 Matrix Inversion Lemma. B.3 Cholesky Decomposition. B.4 Distance Measures. B.5 Super-Gaussian Probability Density Functions. B.6 Entropy. B.7 Relative Entropy. B.8 Transformation Law of Probabilities. B.9 Cascade of Warping Stages. B.10 Taylor Series. B.11 Correlation and Covariance. B.12 Bessel Functions. B.13 Proof of the Nyquist-Shannon Sampling Theorem. B.14 Proof of Equations (11.31-11.32). B.15 Givens Rotations. B.16 Derivatives with Respect to Complex Vectors. B.17 Perpendicular Projection Operators. Bibliography. Index.

EURASIP Journal on Advances in Signal Processing | 2006

Kalman filters for time delay of arrival-based source localization

Ulrich Klee; Tobias Gehrig; John W. McDonough

In this work, we propose an algorithm for acoustic source localization based on time delay of arrival (TDOA) estimation. In earlier work by other authors, an initial closed-form approximation was first used to estimate the true position of the speaker followed by a Kalman filtering stage to smooth the time series of estimates. In the proposed algorithm, this closed-form approximation is eliminated by employing a Kalman filter to directly update the speakers position estimate based on the observed TDOAs. In particular, the TDOAs comprise the observation associated with an extended Kalman filter whose state corresponds to the speakers position. We tested our algorithm on a data set consisting of seminars held by actual speakers. Our experiments revealed that the proposed algorithm provides source localization accuracy superior to the standard spherical and linear intersection techniques. Moreover, the proposed algorithm, although relying on an iterative optimization scheme, proved efficient enough for real-time operation.

international conference on acoustics, speech, and signal processing | 1997

Speaker adaptive training: a maximum likelihood approach to speaker normalization

Tasos Anastasakos; John W. McDonough; John Makhoul

This paper describes the speaker adaptive training (SAT) approach for speaker independent (SI) speech recognizers as a method for joint speaker normalization and estimation of the parameters of the SI acoustic models. In SAT, speaker characteristics are modeled explicitly as linear transformations of the SI acoustic parameters. The effect of inter-speaker variability in the training data is reduced, leading to parsimonious acoustic models that represent more accurately the phonetically relevant information of the speech signal. The proposed training method is applied to the Wall Street Journal (WSJ) corpus that consists of multiple training speakers. Experimental results in the context of batch supervised adaptation demonstrate the effectiveness of the proposed method in large vocabulary speech recognition tasks and show that significant reductions in word error rate can be achieved over the common pooled speaker-independent paradigm.

international conference on acoustics, speech, and signal processing | 1994

Approaches to topic identification on the switchboard corpus

John W. McDonough; Kenney Ng; Philippe Jeanrenaud; Herbert Gish; Jan Robin Rohlicek

Topic identification (TID) is the automatic classification of speech messages into one of a known set of possible topics. The TID task can be view as having three principal components: 1) event generation, 2) keyword event selection, and 3) topic modeling. Using data from the Switchboard corpus, the authors present experimental results for various approaches to the TID problem and compare the relative effectiveness of each. In addition, they examine the effect of keyword set size on identification accuracy and gauge the loss in performance when mismatched topic modeling and keyword selection schemes are used.<<ETX>>

IEEE Signal Processing Magazine | 2005

Minimum variance distortionless response spectral estimation

Matthias Wölfel; John W. McDonough

In this article, we concentrate on spectral estimation techniques that are useful in extracting the features to be used by automatic speech recognition (ASR) system. As an aid to understanding the spectral estimation process for speech signals, we adopt the source filter model of speech production as presented in X. Huang et al. (2001), wherein speech is divided into two broad classes: voiced and unvoiced. Voiced speech is quasi-periodic, consisting of a fundamental frequency corresponding to the pitch of a speaker, as well as its harmonics. Unvoiced speech is stochastic in nature and is best modeled as white noise convolved with an infinite impulse response filter.

IEEE Signal Processing Magazine | 2012

Microphone Array Processing for Distant Speech Recognition: From Close-Talking Microphones to Far-Field Sensors

Kenichi Kumatani; John W. McDonough; Bhiksha Raj

Distant speech recognition (DSR) holds the promise of the most natural human computer interface because it enables man-machine interactions through speech, without the necessity of donning intrusive body- or head-mounted microphones. Recognizing distant speech robustly, however, remains a challenge. This contribution provides a tutorial overview of DSR systems based on microphone arrays. In particular, we present recent work on acoustic beam forming for DSR, along with experimental results verifying the effectiveness of the various algorithms described here; beginning from a word error rate (WER) of 14.3% with a single microphone of a linear array, our state-of-the-art DSR system achieved a WER of 5.3%, which was comparable to that of 4.2% obtained with a lapel microphone. Moreover, we present an emerging technology in the area of far-field audio and speech processing based on spherical microphone arrays. Performance comparisons of spherical and linear arrays reveal that a spherical array with a diameter of 8.4 cm can provide recognition accuracy comparable or better than that obtained with a large linear array with an aperture length of 126 cm.

workshop on applications of signal processing to audio and acoustics | 2005

Kalman filters for audio-video source localization

Tobias Gehrig; Kai Nickel; Hazim Kemal Ekenel; Ulrich Klee; John W. McDonough

In prior work, we proposed using an extended Kalman filter to directly update position estimates in a speaker localization system based on time delays of arrival. We found that such a scheme provided superior tracking quality as compared with the conventional closed-form approximation methods. In this work, we enhance our audio localizer with video information. We propose an algorithm to incorporate detected face positions in different camera views into the Kalman filter without doing any explicit triangulation. This approach yields a robust source localizer that functions reliably both for segments wherein the speaker is silent, which would be detrimental for an audio only tracker, and wherein many faces appear, which would confuse a video only tracker. We tested our algorithm on a data set consisting of seminars held by actual speakers. Our experiments revealed that the audio-video localizer functioned better than a localizer based solely on audio or solely on video features.

IEEE Transactions on Audio, Speech, and Language Processing | 2009

Beamforming With a Maximum Negentropy Criterion

Kenichi Kumatani; John W. McDonough; Barbara Rauch; Dietrich Klakow; Philip N. Garner; Weifeng Li

In this paper, we address a beamforming application based on the capture of far-field speech data from a single speaker in a real meeting room. After the position of the speaker is estimated by a speaker tracking system, we construct a subband-domain beamformer in generalized sidelobe canceller (GSC) configuration. In contrast to conventional practice, we then optimize the active weight vectors of the GSC so as to obtain an output signal with maximum negentropy (MN). This implies the beamformer output should be as non-Gaussian as possible. For calculating negentropy, we consider the Gamma and the generalized Gaussian (GG) pdfs. After MN beamforming, Zelinski postfiltering is performed to further enhance the speech by removing residual noise. Our beamforming algorithm can suppress noise and reverberation without the signal cancellation problems encountered in the conventional beamforming algorithms. We demonstrate this fact through a set of acoustic simulations. Moreover, we show the effectiveness of our proposed technique through a series of far-field automatic speech recognition experiments on the Multi-Channel Wall Street Journal Audio Visual Corpus (MC-WSJ-AV), a corpus of data captured with real far-field sensors, in a realistic acoustic environment, and spoken by real speakers. On the MC-WSJ-AV evaluation data, the delay-and-sum beamformer with postfiltering achieved a word error rate (WER) of 16.5%. MN beamforming with the Gamma pdf achieved a 15.8% WER, which was further reduced to 13.2% with the GG pdf, whereas the simple delay-and-sum beamformer provided a WER of 17.8%. To the best of our knowledge, no lower error rates at present have been reported in the literature on this automatic speech recognition (ASR) task.

international conference on acoustics speech and signal processing | 1996

Maximum a posteriori adaptation for large scale HMM recognizers

George Zavaliagkos; Richard M. Schwartz; John W. McDonough

We present a framework for maximum a posteriori (MAP) adaptation of large scale HMM recognizers. First we review the standard MAP adaptation for Gaussian mixtures. We then show how MAP can be used to estimated transformations which are shared across many parameters. Finally, we combine both techniques: each of the HMM models is adapted based on an interpolation of MAP estimates obtained under varying degrees of sharing. We evaluate this algorithm for adaptation of a continuous density HMM with 96 K Gaussians and show that very satisfactory improvements can be achieved, especially for adaptation of non-native speakers of American English.

IEEE Transactions on Audio, Speech, and Language Processing | 2007

Adaptive Beamforming With a Minimum Mutual Information Criterion

Kenichi Kumatani; Tobias Gehrig; Uwe Mayer; Emilian Stoimenov; John W. McDonough; Matthias Wölfel

In this paper, we consider an acoustic beamforming application where two speakers are simultaneously active. We construct one subband-domain beamformer in generalized sidelobe canceller (GSC) configuration for each source. In contrast to normal practice, we then jointly optimize the active weight vectors of both GSCs to obtain two output signals with minimum mutual information (MMI). Assuming that the subband snapshots are Gaussian-distributed, this MMI criterion reduces to the requirement that the cross-correlation coefficient of the subband outputs of the two GSCs vanishes. We also compare separation performance under the Gaussian assumption with that obtained from several super-Gaussian probability density functions (pdfs), namely, the Laplace and pdfs. Our proposed technique provides effective nulling of the undesired source, but without the signal cancellation problems seen in conventional beamforming. Moreover, our technique does not suffer from the source permutation and scaling ambiguities encountered in conventional blind source separation algorithms. We demonstrate the effectiveness of our proposed technique through a series of far-field automatic speech recognition experiments on data from the PASCAL Speech Separation Challenge (SSC). On the SSC development data, the simple delay-and-sum beamformer achieves a word error rate (WER) of 70.4%. The MMI beamformer under a Gaussian assumption achieves a 55.2% WER, which is further reduced to 52.0% with a pdf, whereas the WER for data recorded with a close-talking microphone is 21.6%.

Explore More