Taichi Asami
Tokyo Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Taichi Asami.
international conference on acoustics, speech, and signal processing | 2006
Taichi Asami; Koji Iwano; Sadaoki Furui
This paper proposes an automatic stream-weight and threshold estimation method for noise-robust speaker verification using multi-stream HMMs integrating segmental and prosodic information. The proposed method simultaneously optimizes stream-weights and a decision threshold by combining the linear discriminant analysis (LDA) and Adaboost techniques. Experiments were conducted using Japanese connected digit speech contaminated by white noise with various SNRs. In this experiment, a target ratio of false acceptance rate (FAR) and false rejection rate (FRR) was set by 1:1 so as to adjust them to approach an equal error rate (EER). Experimental results show that the proposed method effectively estimates stream-weights and thresholds so that FARs and FRRs are adjusted to EERs in most of the SNR conditions
international conference on acoustics, speech, and signal processing | 2011
Takaaki Fukutomi; Satoshi Kobashikawa; Taichi Asami; Tsubasa Shinozaki; Hirokazu Masataki; Satoshi Takahashi
To improve the performance of call-reason analysis at contact centers, we introduce a novel method to extract call-reason segments from dialogs. It is based on the following two characteristics of contact center conversations; 1) customers state their requests at the beginning of the calls, 2) agents tend to use typical phrases at the end of the call-reason segments. Our proposal acquires these typical phrases from stored speech data automatically and extracts the call-reason segment precisely by detecting the typical phrases. Experiments show that it significantly improves the performance of call-reason information retrieval since it allows the search scope to be limited to the call-reason segments of calls.
IEICE Transactions on Information and Systems | 2008
Taichi Asami; Koji Iwano; Sadaoki Furui
We have previously proposed a noise-robust speaker verification method using fundamental frequency (F0) extracted using the Hough transform. The method also incorporates an automatic stream-weight and decision threshold estimation technique. It has been confirmed that the proposed method is effective for white noise at various SNR conditions. This paper evaluates the proposed method in more practical in-car and elevator-hall noise conditions. The paper first describes the noise-robust F0 extraction method and details of our robust speaker verification method using multi-stream HMMs for integrating the extracted F0 and cepstral features. Details of the automatic stream-weight and threshold estimation method for multi-stream speaker verification framework are also explained. This method simultaneously optimizes stream-weights and a decision threshold by combining the linear discriminant analysis (LDA) and the Adaboost technique. Experiments were conducted using Japanese connected digit speech contaminated by white, in-car, or elevator-hall noise at various SNRs. Experimental results show that the F0 features improve the verification performance in various noisy environments, and that our stream-weight and threshold optimization method effectively estimates control parameters so that FARs and FRRs are adjusted to achieve equal error rates (EERs) under various noisy conditions.
international conference on acoustics, speech, and signal processing | 2017
Taichi Asami; Ryo Masumura; Yoshikazu Yamaguchi; Hirokazu Masataki; Yushi Aono
Constructing deep neural network (DNN) acoustic models from limited training data is an important issue for the development of automatic speech recognition (ASR) applications that will be used in various application-specific acoustic environments. To this end, domain adaptation techniques that train a domain-matched model without overfitting by lever-aging pre-constructed source models are widely used. In this paper, we propose a novel domain adaptation method for DNN acoustic models based on the knowledge distillation framework. Knowledge distillation transfers the knowledge of a teacher model to a student model and offers better generalizability of the student model by controlling the shape of posterior probability distribution of the teacher model, which was originally proposed for model compression. We apply this framework to model adaptation. Our domain adaptation method avoids overfitting of the adapted model trained on limited data by transferring the knowledge of the source model to the adapted model by distillation. Experiments show that the proposed method can effectively avoid the overfitting of convolutional neural network based acoustic models and yield lower error rates than conventional adaptation methods.
conference of the international speech communication association | 2016
Yusuke Ijima; Taichi Asami; Hideyuki Mizuno
This paper presents a novel objective evaluation technique for statistical parametric speech synthesis. One of its novel features is that it focuses on the association between dimensions within the spectral features. We first use a maximal information coefficient to analyze the relationship between subjective scores and associations of spectral features obtained from natural and various types of synthesized speech. The analysis results indicate that the scores improve as the association becomes weaker. We then describe the proposed objective evaluation technique, which uses a voice conversion method to detect the associations within spectral features. We perform subjective and objective experiments to investigate the relationship between subjective scores and objective scores. The proposed objective scores are compared to the mel-cepstral distortion. The results indicate that our objective scores achieve dramatically higher correlation to subjective scores than the mel-cepstral distortion.
international conference on acoustics, speech, and signal processing | 2017
Tsubasa Ochiai; Marc Delcroix; Keisuke Kinoshita; Atsunori Ogawa; Taichi Asami; Shigeru Katagiri; Tomohiro Nakatani
Adapting acoustic models to speakers have shown to greatly improve performance for many tasks. Among the adaptation approaches, exploiting auxiliary features characterizing speakers or environments has received great attention because they allow rapid adaptation, i.e. adaptation with limited amount of speech data such as a single utterance. However, the auxiliary features are usually computed in batch mode, which causes some inevitable latency. In this paper we explore an extension of the auxiliary feature-based adaptation to online processing. We employ auxiliary features obtained from bottleneck speaker vectors and extend their computation to online processing using cumulative moving averaging. We test our proposed approach for deep CNN-based acoustic models, using context adaptive networks to exploit the auxiliary features. Experimental results on the CHiME-3 task demonstrate that the proposed approach can realize online speaker adaptation.
conference of the international speech communication association | 2016
Ryo Masumura; Taichi Asami; Hirokazu Masataki; Yushi Aono; Sumitaka Sakauchi
This paper aims to enhance spoken language identification methods based on direct discriminative modeling of language labels using deep neural networks (DNNs) and long shortterm memory recurrent neural networks (LSTM-RNNs). In conventional methods, frame-by-frame DNNs or LSTM-RNNs are used for utterance-level classification. Although they have strong frame-level classification performance and real-time efficiency, they are not optimized for variable length utterance-level classification since the classification is conducted by simply averaging frame-level prediction results. In addition, the simple classification methodology cannot fully utilize the combination of DNNs and LSTM-RNNs. To address these issues, our idea is to combine the frame-by-frame DNNs and LSTM-RNNs with a sequential generative model based classifier. In the proposed method, we regard posteriorgram sequences generated from a frame-by-frame classifier as feature sequences, and model them with respect to each language using language modeling technologies. The generative model based classifier does not model an identification boundary, so we can flexibly deal with variable length utterances without loss of conventional advantages. Furthermore, the proposed method can support the combination of DNNs and LSTMs using joint posteriorgram sequences, those of generative modeling can capture differences between two posteriorgram sequences. Experiments conducted using the GlobalPhone database demonstrate the proposed method’s effectiveness.
asia pacific signal and information processing association annual summit and conference | 2016
Atsushi Ando; Taichi Asami; Yoshikazu Yamaguchi; Yushi Aono
This paper presents a novel speaker recognition framework that handles duration mismatch between registered and test utterances. The i-vectors extracted from short utterances exhibit high variance due to phoneme imbalance, which causes performance degradation in the duration mismatch condition. Most conventional methods attempt to decrease the variance by offsetting i-vectors or speaker similarity scores, however, the variances caused by duration differences are usually too complex to offset. Instead of conventional offsetting approaches, our proposed method, inspired by ensemble learning, attains low-variance results by generating multiple fixed-length short utterances from registered/test utterances and integrating their speaker similarities. Bootstrapped i-vectors are yielded from generated short utterances and the average PLDA scores between the combinations of registered and test bootstrapped i-vectors are used for speaker decision. Experiments show that the proposed method improves the equal-error-rate in trials of 60 second registered utterances and less than 5 second test utterances with relative error reduction of 73.9%–90.6%. Moreover, it appears that the proposed method has smaller score variance than the baseline.
Computer Speech & Language | 2013
Satoshi Kobashikawa; Atsunori Ogawa; Taichi Asami; Yoshikazu Yamaguchi; Hirokazu Masataki; Satoshi Takahashi
This paper proposes a fast unsupervised acoustic model adaptation technique with efficient statistics accumulation for speech recognition. Conventional adaptation techniques accumulate the acoustic statistics based on a forward-backward algorithm or a Viterbi algorithm. Since both algorithms require a state sequence prior to statistic accumulation, the conventional techniques need time to determine the state sequence by transcribing the target speech in advance. Instead of pre-determining the state sequence, the proposed technique reduces the computation time by accumulating the statistics with state confidence within monophone per frame. It also rapidly selects the appropriate gender acoustic model before adaptation, and further increases the accuracy by employing a power term after adaptation. Recognition experiments using spontaneous speech show that the proposed technique reduces computation time by 57.3% while providing the same accuracy as the conventional adaptation technique.
international conference on acoustics, speech, and signal processing | 2017
Ryo Masumura; Taichi Asami; Hirokazu Masataki; Yushi Aono
Parallel phonetically aware deep neural networks (PPA-DNNs) and long short-term memory recurrent neural networks (PPA-LSTM-RNNs) to enhance frame-by-frame discriminative modeling of spoken language identification are proposed. This idea is inspired by traditional systems based on parallel phoneme recognition followed by language modeling (PPRLM). The proposed methods utilize multiple senone bottleneck features individually extracted from language-dependent senone-based DNNs in a frame-by-frame manner. The multiple senone bottleneck features can yield phonetic awareness to frame-by-frame DNNs and LSTM-RNNs without losing compatibility to real time applications. In experiments, three senone-based DNNs are introduced in order to extract senone bottleneck features, and both single use and parallel use of them are examined. Furthermore, we also examine a combination of PPA-DNNs and PPA-LSTM-RNNs. The proposed methods effectiveness is investigated by comparison with a simple speech aware modeling and traditional systems based on PPRLM.