Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Kenichi Kumatani is active.

Publication


Featured researches published by Kenichi Kumatani.


IEEE Signal Processing Magazine | 2012

Microphone Array Processing for Distant Speech Recognition: From Close-Talking Microphones to Far-Field Sensors

Kenichi Kumatani; John W. McDonough; Bhiksha Raj

Distant speech recognition (DSR) holds the promise of the most natural human computer interface because it enables man-machine interactions through speech, without the necessity of donning intrusive body- or head-mounted microphones. Recognizing distant speech robustly, however, remains a challenge. This contribution provides a tutorial overview of DSR systems based on microphone arrays. In particular, we present recent work on acoustic beam forming for DSR, along with experimental results verifying the effectiveness of the various algorithms described here; beginning from a word error rate (WER) of 14.3% with a single microphone of a linear array, our state-of-the-art DSR system achieved a WER of 5.3%, which was comparable to that of 4.2% obtained with a lapel microphone. Moreover, we present an emerging technology in the area of far-field audio and speech processing based on spherical microphone arrays. Performance comparisons of spherical and linear arrays reveal that a spherical array with a diameter of 8.4 cm can provide recognition accuracy comparable or better than that obtained with a large linear array with an aperture length of 126 cm.


IEEE Transactions on Audio, Speech, and Language Processing | 2009

Beamforming With a Maximum Negentropy Criterion

Kenichi Kumatani; John W. McDonough; Barbara Rauch; Dietrich Klakow; Philip N. Garner; Weifeng Li

In this paper, we address a beamforming application based on the capture of far-field speech data from a single speaker in a real meeting room. After the position of the speaker is estimated by a speaker tracking system, we construct a subband-domain beamformer in generalized sidelobe canceller (GSC) configuration. In contrast to conventional practice, we then optimize the active weight vectors of the GSC so as to obtain an output signal with maximum negentropy (MN). This implies the beamformer output should be as non-Gaussian as possible. For calculating negentropy, we consider the Gamma and the generalized Gaussian (GG) pdfs. After MN beamforming, Zelinski postfiltering is performed to further enhance the speech by removing residual noise. Our beamforming algorithm can suppress noise and reverberation without the signal cancellation problems encountered in the conventional beamforming algorithms. We demonstrate this fact through a set of acoustic simulations. Moreover, we show the effectiveness of our proposed technique through a series of far-field automatic speech recognition experiments on the Multi-Channel Wall Street Journal Audio Visual Corpus (MC-WSJ-AV), a corpus of data captured with real far-field sensors, in a realistic acoustic environment, and spoken by real speakers. On the MC-WSJ-AV evaluation data, the delay-and-sum beamformer with postfiltering achieved a word error rate (WER) of 16.5%. MN beamforming with the Gamma pdf achieved a 15.8% WER, which was further reduced to 13.2% with the GG pdf, whereas the simple delay-and-sum beamformer provided a WER of 17.8%. To the best of our knowledge, no lower error rates at present have been reported in the literature on this automatic speech recognition (ASR) task.


IEEE Transactions on Audio, Speech, and Language Processing | 2007

Adaptive Beamforming With a Minimum Mutual Information Criterion

Kenichi Kumatani; Tobias Gehrig; Uwe Mayer; Emilian Stoimenov; John W. McDonough; Matthias Wölfel

In this paper, we consider an acoustic beamforming application where two speakers are simultaneously active. We construct one subband-domain beamformer in generalized sidelobe canceller (GSC) configuration for each source. In contrast to normal practice, we then jointly optimize the active weight vectors of both GSCs to obtain two output signals with minimum mutual information (MMI). Assuming that the subband snapshots are Gaussian-distributed, this MMI criterion reduces to the requirement that the cross-correlation coefficient of the subband outputs of the two GSCs vanishes. We also compare separation performance under the Gaussian assumption with that obtained from several super-Gaussian probability density functions (pdfs), namely, the Laplace and pdfs. Our proposed technique provides effective nulling of the undesired source, but without the signal cancellation problems seen in conventional beamforming. Moreover, our technique does not suffer from the source permutation and scaling ambiguities encountered in conventional blind source separation algorithms. We demonstrate the effectiveness of our proposed technique through a series of far-field automatic speech recognition experiments on data from the PASCAL Speech Separation Challenge (SSC). On the SSC development data, the simple delay-and-sum beamformer achieves a word error rate (WER) of 70.4%. The MMI beamformer under a Gaussian assumption achieves a 55.2% WER, which is further reduced to 52.0% with a pdf, whereas the WER for data recorded with a close-talking microphone is 21.6%.


2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays | 2011

Channel selection based on multichannel cross-correlation coefficients for distant speech recognition

Kenichi Kumatani; John W. McDonough; Jill Fain Lehman; Bhiksha Raj

In theory, beamforming performance can be improved by using as many microphones as possible, but in practice it has been shown that using all possible channels does not always improve speech recognition performance [1, 2, 3, 4, 5]. In this work, we present a new channel selection method in order to increase the computational efficiency of beamforming for distant speech recognition (DSR) without sacrficing performance.


international conference on acoustics, speech, and signal processing | 2008

Filter bank design based on minimization of individual aliasing terms for minimum mutual information subband adaptive beamforming

Kenichi Kumatani; John W. McDonough; S. Schachl; Dietrich Klakow; Philip N. Garner; Weifeng Li

This paper presents new filter bank design methods for sub- band adaptive beamforming. In this work, we design analysis and synthesis prototypes for modulated filter banks so as to minimize each aliasing term individually. We then drive the total response error to null by constraining these prototypes to be Nyquist(M) filters. Thereafter those modulated filter banks are applied to a speech separation system which extracts a target speech signal. In our system, speech signals are first transformed into the subband domain with our filter banks, and the subband components are then processed with a beamforming algorithm. Following beamforming, post-filtering and binary masking are further performed to remove residual noises. We show that our filter banks can suppress the residual aliasing distortion more than conventional ones. Furthermore, we demonstrate the effectiveness of our design techniques through a set of automatic speech recognition experiments on the multi-channel speech data from the PASCAL Speech Separation Challenge. The experimental results prove that our beamforming system with the proposed filter banks achieves the best recognition performance, a 39.6 % word error rate (WER), with half the amount of computation of that of the conventional filter banks while the perfect reconstruction filter banks provided a 44.4 % WER.


international conference on machine learning | 2006

The ISL RT-06S speech-to-text system

Christian Fügen; Shajith Ikbal; Florian Kraft; Kenichi Kumatani; Kornel Laskowski; John W. McDonough; Mari Ostendorf; Sebastian Stüker; Matthias Wölfel

This paper describes the 2006 lecture and conference meeting speech-to-text system developed at the Interactive Systems Laboratories (ISL), for the individual head-mounted microphone (IHM), single distant microphone (SDM), and multiple distant microphone (MDM) conditions, which was evaluated in the RT-06S Rich Transcription Meeting Evaluation sponsored by the US National Institute of Standards and Technologies (NIST). We describe the principal differences between our current system and those submitted in previous years, namely improved acoustic and language models, cross adaptation between systems with different front-ends and phoneme sets, and the use of various automatic speech segmentation algorithms.


2008 Hands-Free Speech Communication and Microphone Arrays | 2008

Adaptive Beamforming with a Maximum Negentropy Criterion

Kenichi Kumatani; John W. McDonough; Dietrich Klakow; Philip N. Garner; Weifeng Li

In this paper, we address an adaptive beamforming application in realistic acoustic conditions. After the position of a speaker is estimated by a speaker tracking system, we construct a subband-domain beamformer in generalized sidelobe canceller (GSC) configuration. In contrast to conventional practice, we then optimize the active weight vectors of the GSC so as to obtain an output signal with maximum negentropy (MN). This implies the beamformer output should be as non-Gaussian as possible. For calculating negentropy, we consider the Gamma and the generalized Gaussian (GG) pdfs. After MN beamforming, Zelinski post-filtering is performed to further enhance the speech by removing residual noise. Our beamforming algorithm can suppress noise and reverberation without the signal cancellation problems encountered in the conventional adaptive beamforming algorithms. We demonstrate the effectiveness of our proposed technique through a series of far-field automatic speech recognition experiments on the Multi-Channel Wall Street Journal Audio Visual Corpus (MC-WSJ-AV). On the MC-WSJ-AV evaluation data, the delay-and-sum beamformer with post-filtering achieved a word error rate (WER) of 16.5%. MN beamforming with the Gamma pdf achieved a 15.8% WER, which was further reduced to 13.2% with the GG pdf, whereas the simple delay-and-sum beamformer provided a WER of 17.8%.


international conference on acoustics, speech, and signal processing | 2013

Speaker tracking with spherical microphone arrays

John W. McDonough; Kenichi Kumatani; Takayuki Arakawa; Kazumasa Yamamoto; Bhiksha Raj

In prior work, we investigated the application of a spherical microphone array to a distant speech recognition task. In that work, the relative positions of a fixed loud speaker and the spherical array required for beamforming were measured with an optical tracking device. In the present work, we investigate how these relative positions can be determined automatically for real, human speakers based solely on acoustic evidence. We first derive an expression for the complex pressure field of a plane wave scattering from a rigid sphere. We then use this theoretical field as the predicted observation in an extended Kalman filter whose state is the speakers current position, the direction of arrival of the plane wave. By minimizing the squared-error between the predicted pressure field and that actually recorded, we are able to infer the position of the speaker.


ieee automatic speech recognition and understanding workshop | 2011

Maximum kurtosis beamforming with a subspace filter for distant speech recognition

Kenichi Kumatani; John W. McDonough; Bhiksha Raj

This paper presents a new beamforming method for distant speech recognition (DSR). The dominant mode subspace is considered in order to efficiently estimate the active weight vectors for maximum kurtosis (MK) beamforming with the generalized sidelobe canceler (GSC). We demonstrated in [1], [2], [3] that the beamforming method based on the maximum kurtosis criterion can remove reverberant and noise effects without signal cancellation encountered in the conventional beamforming algorithms. The MK beamforming algorithm, however, required a relatively large amount of data for reliably estimating the active weight vector because it relies on a numerical optimization algorithm. In order to achieve efficient estimation, we propose to cascade the subspace (eigenspace) filter [4, §6.8] with the active weight vector. The subspace filter can decompose the output of the blocking matrix into directional signals and ambient noise components. Then, the ambient noise components are averaged and would be subtracted from the beamformers output, which leads to reliable estimation as well as significant computational reduction. We show the effectiveness of our method through a set of distant speech recognition experiments on real microphone array data captured in the real environment. Our new beamforming algorithm provided the best recognition performance among conventional beamforming techniques, a word error rate (WER) of 5.3 %, which is comparable to the WER of 4.2 % obtained with a close-talking microphone. Moreover, it achieved better recognition performance with a fewer amounts of adaptation data than the conventional MK beamformer.


2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays | 2011

Improving hands-free speech recognition in a car through audio-visual voice activity detection

Friedrich Faubel; Munir Georges; Kenichi Kumatani; Andrés Bruhn; Dietrich Klakow

In this work, we show how the speech recognition performance in a noisy car environment can be improved by combining audio-visual voice activity detection (VAD) with microphone array processing techniques. That is accomplished by enhancing the multi-channel audio signal in the speaker localization step, through per channel power spectral subtraction whose noise estimates are obtained from the non-speech segments identified by VAD. This noise reduction step improves the accuracy of the estimated speaker positions and thereby the quality of the beamformed signal of the consecutive array processing step. Audio-visual voice activity detection has the advantage of being more robust in acoustically demanding environments. This claim is substantiated through speech recognition experiments on the AVICAR corpus, where the proposed localization framework gave a WER of 7.1% in combination with delay-and-sum beamforming. This compares to a WER of 8.9% for speaker localizing with audio-only VAD and 11.6% without VAD and 15.6 for a single distant channel.

Collaboration


Dive into the Kenichi Kumatani's collaboration.

Top Co-Authors

Avatar

John W. McDonough

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar

Bhiksha Raj

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Weifeng Li

École Polytechnique Fédérale de Lausanne

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Matthias Wölfel

Karlsruhe Institute of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Emilian Stoimenov

Karlsruhe Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Rainer Stiefelhagen

Karlsruhe Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Tobias Gehrig

Karlsruhe Institute of Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge