Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Atsunori Ogawa is active.

Publication


Featured researches published by Atsunori Ogawa.


IEEE Transactions on Audio, Speech, and Language Processing | 2012

Low-Latency Real-Time Meeting Recognition and Understanding Using Distant Microphones and Omni-Directional Camera

Takaaki Hori; Shoko Araki; Takuya Yoshioka; Masakiyo Fujimoto; Shinji Watanabe; Takanobu Oba; Atsunori Ogawa; Kazuhiro Otsuka; Dan Mikami; Keisuke Kinoshita; Tomohiro Nakatani; Atsushi Nakamura; Junji Yamato

This paper presents our real-time meeting analyzer for monitoring conversations in an ongoing group meeting. The goal of the system is to recognize automatically “who is speaking what” in an online manner for meeting assistance. Our system continuously captures the utterances and face poses of each speaker using a microphone array and an omni-directional camera positioned at the center of the meeting table. Through a series of advanced audio processing operations, an overlapping speech signal is enhanced and the components are separated into individual speakers channels. Then the utterances are sequentially transcribed by our speech recognizer with low latency. In parallel with speech recognition, the activity of each participant (e.g., speaking, laughing, watching someone) and the circumstances of the meeting (e.g., topic, activeness, casualness) are detected and displayed on a browser together with the transcripts. In this paper, we describe our techniques and our attempt to achieve the low-latency monitoring of meetings, and we show our experimental results for real-time meeting transcription.


international conference on acoustics speech and signal processing | 1998

Balancing acoustic and linguistic probabilities

Atsunori Ogawa; Kazuya Takeda; Fumitada Itakura

The length of the word sequence is not taken into account under language modeling of n-gram local probability modeling. Due to this property the optimal values of the language weight and word insertion penalty for balancing acoustic and linguistic probabilities is affected by the length of word sequence. To deal with this problem, a new language model is developed based on the Bernoulli trial model taking the length of the word sequence into account. Not only better recognition accuracy but also more robust balancing with acoustic probability compared with the normal n-gram model of the proposed method is confirmed through recognition experiments.


Computer Speech & Language | 2013

Speech recognition in living rooms: Integrated speech enhancement and recognition system based on spatial, spectral and temporal modeling of sounds

Marc Delcroix; Keisuke Kinoshita; Tomohiro Nakatani; Shoko Araki; Atsunori Ogawa; Takaaki Hori; Shinji Watanabe; Masakiyo Fujimoto; Takuya Yoshioka; Takanobu Oba; Yotaro Kubo; Mehrez Souden; Seong-Jun Hahm; Atsushi Nakamura

Research on noise robust speech recognition has mainly focused on dealing with relatively stationary noise that may differ from the noise conditions in most living environments. In this paper, we introduce a recognition system that can recognize speech in the presence of multiple rapidly time-varying noise sources as found in a typical family living room. To deal with such severe noise conditions, our recognition system exploits all available information about speech and noise; that is spatial (directional), spectral and temporal information. This is realized with a model-based speech enhancement pre-processor, which consists of two complementary elements, a multi-channel speech-noise separation method that exploits spatial and spectral information, followed by a single channel enhancement algorithm that uses the long-term temporal characteristics of speech obtained from clean speech examples. Moreover, to compensate for any mismatch that may remain between the enhanced speech and the acoustic model, our system employs an adaptation technique that combines conventional maximum likelihood linear regression with the dynamic adaptive compensation of the variance of the Gaussians of the acoustic model. Our proposed system approaches human performance levels by greatly improving the audible quality of speech and substantially improving the keyword recognition accuracy.


spoken language technology workshop | 2010

Real-time meeting recognition and understanding using distant microphones and omni-directional camera

Takaaki Hori; Shoko Araki; Takuya Yoshioka; Masakiyo Fujimoto; Shinji Watanabe; Takanobu Oba; Atsunori Ogawa; Kazuhiro Otsuka; Dan Mikami; Keisuke Kinoshita; Tomohiro Nakatani; Atsushi Nakamura; Junji Yamato

This paper presents our newly developed real-time meeting analyzer for monitoring conversations in an ongoing group meeting. The goal of the system is to automatically recognize “who is speaking what” in an online manner for meeting assistance. Our system continuously captures the utterances and the face pose of each speaker using a distant microphone array and an omni-directional camera at the center of the meeting table. Through a series of advanced audio processing operations, an overlapping speech signal is enhanced and the components are separated into individual speakers channels. Then the utterances are sequentially transcribed by our speech recognizer with low latency. In parallel with speech recognition, the activity of each participant (e.g. speaking, laughing, watching someone) and the situation of the meeting (e.g. topic, activeness, casualness) are detected and displayed on a browser together with the transcripts. In this paper, we describe our techniques and our attempt to achieve the low-latency monitoring of meetings, and we show our experimental results for real-time meeting transcription.


international conference on acoustics, speech, and signal processing | 2008

Weighted distance measures for efficient reduction of Gaussian mixture components in HMM-based acoustic model

Atsunori Ogawa; Satoshi Takahashi

In this paper, two weighted distance measures; the weighted K-L divergence and the Bayesian criterion-based distance measure are proposed to efficiently reduce the Gaussian mixture components in the HMM-based acoustic model. Conventional distance measures such as the K-L divergence and the Bhattacharyya distance consider only distribution parameters (i.e. mean and variance vectors of Gaussian pdfs). Another example considers only mixture weights. In contrast to them, the two proposed distance measures consider both distribution parameters and mixture weights. Experimental results showed that the component-reduced acoustic models created using the proposed distance measures were more compact and computationally efficient than those created using conventional distance measures.


international conference on acoustics, speech, and signal processing | 2016

Spatial correlation model based observation vector clustering and MVDR beamforming for meeting recognition

Shoko Araki; Masahiro Okada; T. Higuchi; Atsunori Ogawa; Tomohiro Nakatani

This paper addresses a minimum variance distortionless response (MVDR) beamforming based speech enhancement approach for meeting speech recognition. In a meeting situation, speaker overlaps and noise signals are not negligible. To handle these issues, we employ MVDR beamforming, where accurate estimation of the steering vector is paramount. We recently found that steering vector estimation by clustering the time-frequency components of microphone observation vectors performs well as regards real-world noise reduction. The clustering is performed by taking a cue from the spatial correlation matrix of each speaker, which is realized by modeling the time-frequency components of the observation vectors with a complex Gaussian mixture model (CGMM). Experimental results with real recordings show that the proposed MVDR scheme outperforms conventional null-beamformer based speech enhancement in a meeting situation.


international conference on acoustics, speech, and signal processing | 2012

Error type classification and word accuracy estimation using alignment features from word confusion network

Atsunori Ogawa; Takaaki Hori; Atsushi Nakamura

This paper addresses error type classification in continuous speech recognition (CSR). In CSR, errors are classified into three types, namely, the substitution, insertion and deletion errors, by making an alignment between a recognized word sequence and its reference transcription with a dynamic programming (DP) procedure. We propose a method for deriving such alignment features from a word confusion network (WCN) without using the reference transcription. We show experimentally that the WCN-based alignment features steadily improve the performance of error type classification. They also improve the performance of out-of-vocabulary (OOV) word detection, since OOV word utterances are highly correlated with a particular alignment pattern. In addition, we show that the word accuracy can be estimated from the WCN-based alignment features and more accurately from the error type classification result without using the reference transcription.


international conference on acoustics, speech, and signal processing | 2016

Context adaptive deep neural networks for fast acoustic model adaptation in noisy conditions

Marc Delcroix; Keisuke Kinoshita; Chengzhu Yu; Atsunori Ogawa; Takuya Yoshioka; Tomohiro Nakatani

Deep neural network (DNN) based acoustic models have greatly improved the performance of automatic speech recognition (ASR) for various tasks. Further performance improvements have been reported when making DNNs aware of the acoustic context (e.g. speaker or environment) for example by adding auxiliary features to the input, such as noise estimates or speaker i-vectors. We have recently proposed a context adaptive DNN (CA-DNN), which is another approach to exploit the acoustic context information within a DNN. A CA-DNN is a DNN that has one or several factorized layers, i.e. layers that use a different set of parameters to process each acoustic context class. The output of a factorized layer is obtained by the weighted sum over the contribution of the different context classes, given weights over the context classes. In our previous work, the class weights were computed independently of the recognizer. In this paper, we extend our previous work by introducing the joint training of the CA-DNN parameters and the class weights computation. Consequently, the class weights and the associated class definitions can be optimized for ASR. We report experimental results on the AURORA4 noisy speech recognition task showing the potential of our approach for fast unsupervised adaptation.


Speech Communication | 2012

Joint estimation of confidence and error causes in speech recognition

Atsunori Ogawa; Atsushi Nakamura

Speech recognition errors are essentially unavoidable under the severe conditions of real fields, and so confidence estimation, which scores the reliability of a recognition result, plays a critical role in the development of speech recognition based real-field application systems. However, if we are to develop an application system that provides a high-quality service, in addition to achieving accurate confidence estimation, we also need to extract and exploit further supplementary information from a speech recognition engine. As a first step in this direction, in this paper, we propose a method for estimating the confidence of a recognition result while jointly detecting the causes of recognition errors based on a discriminative model. The confidence of a recognition result and the nonexistence/existence of error causes are naturally correlated. By directly capturing these correlations between the confidence and error causes, the proposed method enhances its estimation performance for the confidence and each error cause complementarily. In the initial speech recognition experiments, the proposed method provided higher confidence estimation accuracy than a discriminative model based state-of-the-art confidence estimation method. Moreover, the effective estimation mechanism of the proposed method was confirmed by the detailed analyses.


spoken language technology workshop | 2012

Recognition rate estimation based on word alignment network and discriminative error type classification

Atsunori Ogawa; Takaaki Hori; Atsushi Nakamura

Techniques for estimating recognition rates without using reference transcriptions are essential if we are to judge whether or not speech recognition technology is applicable to a new task. This paper proposes two recognition rate estimation methods for continuous speech recognition. The first is an easy-to-use method based on a word alignment network (WAN) obtained from a word confusion network through simple conversion procedures. A WAN contains the correct (C), substitution error (S), insertion error (I) and deletion error (D) probabilities word-by-word for a recognition result. By summing these CSID probabilities individually, the percent correct and word accuracy (WACC) can be estimated without using a reference transcription. The second more advanced method refines the CSID probabilities provided by a WAN based on discriminative error type classification (ETC) and estimates the recognition rates more accurately. In the experiments on the MIT lecture speech corpus, we obtained 0.97 of correlation coefficient between the true WACCs calculated by a scoring tool using reference transcriptions and the WACCs estimated from the discriminative ETC results.

Collaboration


Dive into the Atsunori Ogawa's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Keisuke Kinoshita

Nippon Telegraph and Telephone

View shared research outputs
Top Co-Authors

Avatar

Takaaki Hori

Mitsubishi Electric Research Laboratories

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Shoko Araki

Nippon Telegraph and Telephone

View shared research outputs
Top Co-Authors

Avatar

Shinji Watanabe

Mitsubishi Electric Research Laboratories

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge