Is this you? Create Your Porfile

Akinobu Lee

Nagoya Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Akinobu Lee is active.

Explore More

Publication

Featured researches published by Akinobu Lee.

IEEE Transactions on Audio, Speech, and Language Processing | 2006

Blind source separation based on a fast-convergence algorithm combining ICA and beamforming

Hiroshi Saruwatari; Toshiya Kawamura; Tsuyoki Nishikawa; Akinobu Lee; Kiyohiro Shikano

We propose a new algorithm for blind source separation (BSS), in which independent component analysis (ICA) and beamforming are combined to resolve the slow-convergence problem through optimization in ICA. The proposed method consists of the following three parts: (a) frequency-domain ICA with direction-of-arrival (DOA) estimation, (b) null beamforming based on the estimated DOA, and (c) integration of (a) and (b) based on the algorithm diversity in both iteration and frequency domain. The unmixing matrix obtained by ICA is temporally substituted by the matrix based on null beamforming through iterative optimization, and the temporal alternation between ICA and beamforming can realize fast- and high-convergence optimization. The results of the signal separation experiments reveal that the signal separation performance of the proposed algorithm is superior to that of the conventional ICA-based BSS method, even under reverberant conditions.

international conference on acoustics, speech, and signal processing | 2000

A new phonetic tied-mixture model for efficient decoding

Akinobu Lee; Tatsuya Kawahara; Kazuya Takeda; Kiyohiro Shikano

A phonetic tied-mixture (PTM) model for efficient large vocabulary continuous speech recognition is presented. It is synthesized from context-independent phone models with 64 mixture components per state by assigning different mixture weights according to the shared states of triphones. Mixtures are then re-estimated for optimization. The model achieves a word error rate of 7.0% with a 20000-word dictation of newspaper corpus, which is comparable to the best figure by the triphone of much higher resolutions. Compared with conventional PTMs that share Gaussians by all states, the proposed model is easily trained and reliably estimated. Furthermore, the model enables the decoder to perform efficient Gaussian pruning. It is found out that computing only two out of 64 components does not cause any loss of accuracy. Several methods for the pruning are proposed and compared, and the best one reduced the computation to about 20%.

international conference on acoustics, speech, and signal processing | 2001

Gaussian mixture selection using context-independent HMM

Akinobu Lee; Tatsuya Kawahara; Kiyohiro Shikano

We address a method to efficiently select Gaussian mixtures for fast acoustic likelihood computation. It makes use of context-independent models for selection and back-off of corresponding triphone models. Specifically, for the k-best phone models by the preliminary evaluation, triphone models of higher resolution are applied, and others are assigned likelihoods with the monophone models. This selection scheme assigns more reliable back-off likelihoods to the un-selected states than the conventional Gaussian selection based on a VQ codebook. It can also incorporate efficient Gaussian pruning at the preliminary evaluation, which offsets the increased size of the pre-selection model. Experimental results show that the proposed method achieves comparable performance as the standard Gaussian selection, and performs much better under aggressive pruning condition. Together with the phonetic tied-mixture modeling, acoustic matching cost is reduced to almost 14% with little loss of accuracy.

international conference on acoustics, speech, and signal processing | 2004

Real-time word confidence scoring using local posterior probabilities on tree trellis search

Akinobu Lee; Kiyohiro Shikano; Tatsuya Kawahara

Confidence scoring based on word posterior probability is usually performed as a post process of speech recognition decoding, and also needs a large number of word hypotheses to get enough confidence quality. We propose a simple way of computing the word confidence using estimated posterior probability while decoding. At the word expansion of stack decoding search, the local sentence likelihoods that contain heuristic scores of unreached segment are directly used to compute the posterior probabilities. Experimental results showed that, although the likelihoods are not optimal, we can provide slightly better confidence measures compared with N-best lists, while the computation is faster than the 100-best method because no N-best decoding is required.

Life-like characters | 2004

Galatea: Open-Source Software for Developing Anthropomorphic Spoken Dialog Agents

Shinichi Kawamoto; Hiroshi Shimodaira; Tsuneo Nitta; Takuya Nishimoto; Satoshi Nakamura; Katsunobu Itou; Shigeo Morishima; Tatsuo Yotsukura; Atsuhiko Kai; Akinobu Lee; Yoichi Yamashita; Takao Kobayashi; Keiichi Tokuda; Keikichi Hirose; Nobuaki Minematsu; Atsushi Yamada; Yasuharu Den; Takehito Utsuro; Shigeki Sagayama

Galatea is a software toolkit to develop a human-like spoken dialog agent. In order to easily integrate the modules of different characteristics including speech recognizer, speech synthesizer, facial animation synthesizer, and dialog controller, each module is modeled as a virtual machine having a simple common interface and connected to each other through a broker (communication manager). Galatea employs model-based speech and facial animation synthesizers whose model parameters are adapted easily to those for an existing person if his or her training data is given. The software toolkit that runs on both UNIX/Linux and Windows operating systems will be publicly available in the middle of 2003 [7, 6].

ieee automatic speech recognition and understanding workshop | 2003

Accurate hidden Markov models for non-audible murmur (NAM) recognition based on iterative supervised adaptation

Panikos Heracleous; Yoshitaka Nakajima; Akinobu Lee; Hiroshi Saruwatari; Kiyohiro Shikano

In previous works, we introduced a special device (Non-Audible Murmur (NATM) microphone) able to detect very quietly uttered speech (murmur), which cannot be heard by listeners near the talker. Experimental results showed the efficiency of the device in NAM recognition. Using normal-speech monophone hidden Markov models (HMM) retrained with NAM data from a specific speaker, we could recognize NAM with high accuracy. Although the results were very promising, a serious problem is the HMM retraining, which requires a large amount of training data. In this paper, we introduce a new method for NAM recognition, which requires only a small amount of NAM data for training. The proposed method is based on supervised adaptation. The main difference from other adaptation approaches lies in the fact that instead of single-iteration adaptation, we use iterative adaptation (iterative supervised MLLR). Experiments prove the efficiency of the proposed method. Using normal-speech clean initial models and only 350 adaptation NAM utterances, we achieved a recognition accuracy of 88.62%, which is a very promising result. Therefore, with a small amount of adaptation data, we were able to create accurate individual HMM. We also introduce results of experiments, which show the effects of the number of iterations, the amount of adaptation data, and the regression tree classes.

international conference on acoustics, speech, and signal processing | 2013

Mmdagent—A fully open-source toolkit for voice interaction systems

Akinobu Lee; Keiichiro Oura; Keiichi Tokuda

This paper describes development of an open-source toolkit which makes it possible to explore a vast variety of aspects in speech interactions at spoken dialog systems and speech interfaces. The toolkit tightly incorporates recent speech recognition and synthesis technologies with a 3-D CG rendering module that can manipulates expressive embodied agent characters. The software design and its interfaces are carefully designed to be fully open toolkit. Ongoing demonstration experiments to public indicates that it is promoting related researches and developments of voice interaction systems in various scenes.

intelligent robots and systems | 2002

ASKA: receptionist robot with speech dialogue system

Ryuichi Nisimura; Takashi Uchida; Akinobu Lee; Hiroshi Saruwatari; Kiyohiro Shikano; Yoshio Matsumoto

We implemented a humanoid robot, ASKA, in our university reception desk for the computerized university guidance. ASKA can recognize a users question utterance, and answer the users question by its text-to-speech voice, hand gesture and head movement. This paper describes the speech related parts of ASKA. ASKA can deal with a wide task domain of 20k large vocabulary using a word trigram model and an elaborated speaker-independent acoustic model. ASKA can also make a response with keyword and key-phrase detection in the N-best speech recognition results. The word recognition rate for the reception task is 90.9%, and the rate for the out-of-domain task is 78.9%. The correct response rate for the reception task is 61.7%. Users can enjoy their question-answering with ASKA.

international conference on acoustics, speech, and signal processing | 2009

Voice conversion based on simultaneous modelling of spectrum and F0

Kaori Yutani; Yosuke Uto; Yoshihiko Nankaku; Akinobu Lee; Keiichi Tokuda

This paper proposes a simultaneous modeling of spectrum and F0 for voice conversion based on MSD (Multi-Space Probability Distribution) models. As a conventional technique, a spectral conversion based on GMM (Gaussian Mixture Model) has been proposed. Although this technique converts spectral feature sequences nonlinearly based on GMM, F0 sequences are usually converted by a simple linear function. This is because F0 is undefined in unvoiced segments. To overcome this problem, we apply MSD models. The MSD-GMM allows to model continuous F0 values in voiced frames and a discrete symbol representing unvoiced frames within an unified framework. Furthermore, the MSD-HMM is adopted to model long term correlations in F0 sequences.

IEICE Transactions on Information and Systems | 2008

A Fully Consistent Hidden Semi-Markov Model-Based Speech Recognition System

Keiichiro Oura; Heiga Zen; Yoshihiko Nankaku; Akinobu Lee; Keiichi Tokuda

In a hidden Markov model (HMM), state duration probabilities decrease exponentially with time, which fails to adequately represent the temporal structure of speech. One of the solutions to this problem is integrating state duration probability distributions explicitly into the HMM. This form is known as a hidden semi-Markov model (HSMM). However, though a number of attempts to use HSMMs in speech recognition systems have been proposed, they are not consistent because various approximations were used in both training and decoding. By avoiding these approximations using a generalized forward-backward algorithm, a context-dependent duration modeling technique and weighted finite-state transducers (WFSTs), we construct a fully consistent HSMM-based speech recognition system. In a speaker-dependent continuous speech recognition experiment, our system achieved about 9.1% relative error reduction over the corresponding HMM-based system.

Explore More