Atsuhiko Kai
Shizuoka University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Atsuhiko Kai.
Life-like characters | 2004
Shinichi Kawamoto; Hiroshi Shimodaira; Tsuneo Nitta; Takuya Nishimoto; Satoshi Nakamura; Katsunobu Itou; Shigeo Morishima; Tatsuo Yotsukura; Atsuhiko Kai; Akinobu Lee; Yoichi Yamashita; Takao Kobayashi; Keiichi Tokuda; Keikichi Hirose; Nobuaki Minematsu; Atsushi Yamada; Yasuharu Den; Takehito Utsuro; Shigeki Sagayama
Galatea is a software toolkit to develop a human-like spoken dialog agent. In order to easily integrate the modules of different characteristics including speech recognizer, speech synthesizer, facial animation synthesizer, and dialog controller, each module is modeled as a virtual machine having a simple common interface and connected to each other through a broker (communication manager). Galatea employs model-based speech and facial animation synthesizers whose model parameters are adapted easily to those for an existing person if his or her training data is given. The software toolkit that runs on both UNIX/Linux and Windows operating systems will be publicly available in the middle of 2003 [7, 6].
EURASIP Journal on Advances in Signal Processing | 2012
Longbiao Wang; Kyohei Odani; Atsuhiko Kai
A blind dereverberation method based on power spectral subtraction (SS) using a multi-channel least mean squares algorithm was previously proposed to suppress the reverberant speech without additive noise. The results of isolated word speech recognition experiments showed that this method achieved significant improvements over conventional cepstral mean normalization (CMN) in a reverberant environment. In this paper, we propose a blind dereverberation method based on generalized spectral subtraction (GSS), which has been shown to be effective for noise reduction, instead of power SS. Furthermore, we extend the missing feature theory (MFT), which was initially proposed to enhance the robustness of additive noise, to dereverberation. A one-stage dereverberation and denoising method based on GSS is presented to simultaneously suppress both the additive noise and nonstationary multiplicative noise (reverberation). The proposed dereverberation method based on GSS with MFT is evaluated on a large vocabulary continuous speech recognition task. When the additive noise was absent, the dereverberation method based on GSS with MFT using only 2 microphones achieves a relative word error reduction rate of 11.4 and 32.6% compared to the dereverberation method based on power SS and the conventional CMN, respectively. For the reverberant and noisy speech, the dereverberation and denoising method based on GSS achieves a relative word error reduction rate of 12.8% compared to the conventional CMN with GSS-based additive noise reduction method. We also analyze the effective factors of the compensation parameter estimation for the dereverberation method based on SS, such as the number of channels (the number of microphones), the length of reverberation to be suppressed, and the length of the utterance used for parameter estimation. The experimental results showed that the SS-based method is robust in a variety of reverberant environments for both isolated and continuous speech recognition and under various parameter estimation conditions.
international conference on acoustics, speech, and signal processing | 2013
Longbiao Wang; Zhaofeng Zhang; Atsuhiko Kai
A dereverberation method based on generalized spectral subtraction (GSS) using a multi-channel least mean square (MCLMS) approach achieved significantly improved results on speech recognition experiments compared with conventional methods. In this study, we employ this method for hands-free speaker identification. The GSS-based dereverberation method using clean speech models degrades speaker identification performance, although it is very effective for speech recognition. One reason may be that the GSS-based dereverberation method causes distortion such as distortion characteristics between clean speech and dereverberant speech. In this study, we address this problem by training speaker models using dereverberant speech, which is obtained by suppressing reverberation from arbitrary artificial reverberant speech. We also propose a method that combines various compensation parameter sets to improve speaker identification and provide an efficient computational method. The speaker identification experiment was performed on large-scale farfield speech, with reverberant environments different to the training environments. The proposed method achieved a relative error reduction of 87.5%, compared with conventional cepstral mean normalization with beamforming using clean speech models, and 44.8% compared with reverberant speech models.
international symposium on chinese spoken language processing | 2014
Yuma Ueda; Longbiao Wang; Atsuhiko Kai; Xiong Xiao; Eng Siong Chng; Haizhou Li
In this paper, we propose a robust distant-talking speech recognition by combining cepstral domain denoising autoencoder (DAE) and temporal structure normalization (TSN) filter. As DAE has a deep structure and nonlinear processing steps, it is flexible enough to model highly nonlinear mapping between input and output space. In this paper, we train a DAE to map reverberant and noisy speech features to the underlying clean speech features in the cepstral domain. For the proposed method, after applying a DAE in the cepstral domain of speech to suppress reverberation, we apply a post-processing technology based on temporal structure normalization (TSN) filter to reduce the noise and reverberation effects by normalizing the modulation spectra to reference spectra of clean speech. The proposed method was evaluated using speech in simulated and real reverberant environments. By combining a cepstral-domain DAE and TSN, the average Word Error Rate (WER) was reduced from 25.2 % of the baseline system to 21.2 % in simulated environments and from 47.5 % to 41.3 % in real environments, respectively.
Eurasip Journal on Audio, Speech, and Music Processing | 2014
Zhaofeng Zhang; Longbiao Wang; Atsuhiko Kai
Previously, a dereverberation method based on generalized spectral subtraction (GSS) using multi-channel least mean-squares (MCLMS) has been proposed. The results of speech recognition experiments showed that this method achieved a significant improvement over conventional methods. In this paper, we apply this method to distant-talking (far-field) speaker recognition. However, for far-field speech, the GSS-based dereverberation method using clean speech models degrades the speaker recognition performance. This may be because GSS-based dereverberation causes some distortion between clean speech and dereverberant speech. In this paper, we address this problem by training speaker models using dereverberant speech obtained by suppressing reverberation from arbitrary artificial reverberant speech. Furthermore, we propose an efficient computational method for a combination of the likelihood of dereverberant speech using multiple compensation parameter sets. This addresses the problem of determining optimal compensation parameters for GSS. We report the results of a speaker recognition experiment performed on large-scale far-field speech with different reverberant environments to the training environments. The proposed GSS-based dereverberation method achieves a recognition rate of 92.2%, which compares well with conventional cepstral mean normalization with delay-and-sum beamforming using a clean speech model (49.0%) and a reverberant speech model (88.4%). We also compare the proposed method with another dereverberation technique, multi-step linear prediction-based spectral subtraction (MSLP-GSS). The proposed method achieves a better recognition rate than the 90.6% of MSLP-GSS. The use of multiple compensation parameters further improves the speech recognition performance, giving our approach a recognition rate of 93.6%. We implement this method in a real environment using the optimal compensation parameters estimated from an artificial environment. The results show a recognition rate of 87.8% compared with 72.5% for delay-and-sum beamforming using a reverberant speech model.
Multimedia Tools and Applications | 2016
Bo Ren; Longbiao Wang; Liang Lu; Yuma Ueda; Atsuhiko Kai
The performance of speech recognition in distant-talking environments is severely degraded by the reverberation that can occur in enclosed spaces (e.g., meeting rooms). To mitigate this degradation, dereverberation techniques such as network structure-based denoising autoencoders and multi-step linear prediction are used to improve the recognition accuracy of reverberant speech. Regardless of the reverberant conditions, a novel discriminative bottleneck feature extraction approach has been demonstrated to be effective for speech recognition under a range of conditions. As bottleneck feature extraction is not primarily designed for dereverberation, we are interested in whether it can compensate for other carefully designed dereverberation approaches. In this paper, we propose three schemes covering both front-end processing (cascaded combination and parallel combination) and back-end processing (system combination). Each of these schemes integrates bottleneck feature extraction with dereverberation. The effectiveness of these schemes is evaluated via a series of experiments using the REVERB challenge dataset.
EURASIP Journal on Advances in Signal Processing | 2015
Yuma Ueda; Longbiao Wang; Atsuhiko Kai; Bo Ren
In this paper, we propose an environment-dependent denoising autoencoder (DAE) and automatic environment identification based on a deep neural network (DNN) with blind reverberation estimation for robust distant-talking speech recognition. Recently, DAEs have been shown to be effective in many noise reduction and reverberation suppression applications because higher-level representations and increased flexibility of the feature mapping function can be learned. However, a DAE is not adequate in mismatched training and test environments. In a conventional DAE, parameters are trained using pairs of reverberant speech and clean speech under various acoustic conditions (that is, an environment-independent DAE). To address the above problem, we propose two environment-dependent DAEs to reduce the influence of mismatches between training and test environments. In the first approach, we train various DAEs using speech from different acoustic environments, and the DAE for the condition that best matches the test condition is automatically selected (that is, a two-step environment-dependent DAE). To improve environment identification performance, we propose a DNN that uses both reverberant speech and estimated reverberation. In the second approach, we add estimated reverberation features to the input of the DAE (that is, a one-step environment-dependent DAE or a reverberation-aware DAE). The proposed method is evaluated using speech in simulated and real reverberant environments. Experimental results show that the environment-dependent DAE outperforms the environment-independent one in both simulated and real reverberant environments. For two-step environment-dependent DAE, the performance of environment identification based on the proposed DNN approach is also better than that of the conventional DNN approach, in which only reverberant speech is used and reverberation is not blindly estimated. And, the one-step environment-dependent DAE significantly outperforms the two-step environment-dependent DAE.
asia pacific signal and information processing association annual summit and conference | 2014
Longbiao Wang; Bo Ren; Yuma Ueda; Atsuhiko Kai; Shunta Teraoka; Taku Fukushima
In this paper, we propose a robust distant-talking speech recognition system with asynchronous speech recording. This is implemented by combining denoising autoencoder-based cepstral-domain dereverberation, automatic asynchronous speech (microphone or mobile terminal) selection and environment adaptation. Although applications using mobile terminals have attracted increasing attention, there are few studies that focus on distant-talking speech recognition with asynchronous mobile terminals. For the system proposed in this paper, after applying a denoising autoencoder in the cepstral domain of speech to suppress reverberation and performing Large Vocabulary Continuous Speech Recognition (LVCSR), we adopted automatic asynchronous mobile terminal selection and environment adaptation using speech segments from optimal mobile terminals. The proposed method was evaluated using a reverberant WSJCAMO corpus, which was emitted by a loudspeaker and recorded in a meeting room with multiple speakers by far-field multiple mobile terminals. By integrating a cepstral-domain denoising autoencoder and automatic mobile terminal selection with environment adaptation, the average Word Error Rate (WER) was reduced from 51.8% of the baseline system to 28.8%, i.e., the relative error reduction rate was 44.4% when using multi-condition acoustic models.
text speech and dialogue | 2014
Yuta Kawakami; Longbiao Wang; Atsuhiko Kai; Seiichi Nakagawa
Previously, we proposed a speaker recognition system using a combination of MFCC-based vocal tract feature and phase information which includes rich vocal source information. In this paper, we investigate the efficiency of combination of various vocal tract features (MFCC and LPCC) and vocal source features (phase and LPC residual) for normal-duration and short-duration utterance. The Japanese Newspaper Article Sentence (JNAS) database was used to evaluate our proposed method. The combination of various vocal tract and vocal source features achieved remarkable improvement than the conventional MFCC-based vocal tract feature for both normal-duration and short-duration utterances.
asia-pacific signal and information processing association annual summit and conference | 2013
Longbiao Wang; Kyohei Odani; Atsuhiko Kai; Weifeng Li
In this paper, we propose a method for performing a non-stationary noise reduction and dereverberation method. We use a blind dereverberation method based on spectral subtraction using a multi-channel least mean square algorithm has been proposed in our previous study. To suppress the non-stationary noise, we used a blind source separation based on an efficient fast independent component analysis algorithm. This method is evaluated using a mixed sound of speech and music, and achieves an average relative word error reduction rate of 41.9% and 7.9% compared with a baseline method and the state-of-the-art multistep linear prediction-based dereverberation, respectively, in a real environment.