Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Koji Iwano is active.

Publication


Featured researches published by Koji Iwano.


Speech Communication | 2006

New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer

Javier Latorre; Koji Iwano; Sadaoki Furui

Abstract In this paper we present a new method for synthesizing multiple languages with the same voice, using HMM-based speech synthesis. Our approach, which we call HMM-based polyglot synthesis, consists of mixing speech data from several speakers in different languages, to create a speaker- and language-independent (SI) acoustic model. We then adapt the resulting SI model to a specific speaker in order to create a speaker dependent (SD) acoustic model. Using the SD model it is possible to synthesize any of the languages used to train the SI model, with the voice of the speaker, regardless of the speaker’s language. We show that the performance obtained with our method is better than that of methods based on phone mapping for both adaptation and synthesis. Furthermore, for languages not included during training the performance of our approach also equals or surpasses the performance of any monolingual synthesizers based on the languages used to train the multilingual one. This means that our method can be used to create synthesizers for languages where no speech resources are available.


international conference on acoustics, speech, and signal processing | 2005

A stream-weight optimization method for multi-stream HMMs based on likelihood value normalization

Satoshi Tamura; Koji Iwano; Sadaoki Furui

In the field of audio-visual speech recognition, multi-stream HMM are widely used, thus how to automatically and properly determine stream weight factors using a small data set becomes an important research issue. This paper proposes a new stream-weight optimization method based on an output likelihood normalization criterion. In this method, the stream weights are adjusted to equalize the mean values of log likelihood for all HMM based on likelihood-ratio maximization which achieved significant improvement by using a large optimization data set. The new method is evaluated using Japanese connected digit speech recorded in real-world environments. Using 10 seconds speech data for stream-weight optimization, a 10% absolute accuracy improvement is achieved compared to the result before optimization. By additionally applying the MLLR (maximum likelihood linear regression) adaptation, a 23% improvement is obtained over the audio-only scheme.


signal processing systems | 2004

Multi-Modal Speech Recognition Using Optical-Flow Analysis for Lip Images

Satoshi Tamura; Koji Iwano; Sadaoki Furui

This paper proposes a multi-modal speech recognition method using optical-flow analysis for lip images. Optical flow is defined as the distribution of apparent velocities in the movement of brightness patterns in an image. Since the optical flow is computed without extracting the speakers lip contours and location, robust visual features can be obtained for lip movements. Our method calculates two kinds of visual feature sets in each frame. The first feature set consists of variances of vertical and horizontal components of optical-flow vectors. These are useful for estimating silence/pause periods in noisy conditions since they represent movement of the speakers mouth. The second feature set consists of maximum and minimum values of integral of the optical flow. These are expected to be more effective than the first set since this feature set has not only silence/pause information but also open/close status of the speakers mouth. Each of the feature sets is combined with an acoustic feature set in the framework of HMM-based recognition. Triphone HMMs are trained using the combined parameter sets extracted from clean speech data. Noise-corrupted speech recognition experiments have been carried out using audio-visual data from 11 male speakers uttering connected digits. The following improvements of digit accuracy over the audio-only recognition scheme have been achieved when the visual information was used only for silence HMM: 4% at SNR = 5 dB and 13% at SNR = 10 dB using the integral information of optical flow as the visual feature set.


international conference on acoustics, speech, and signal processing | 2005

Sentence extraction-based presentation summarization techniques and evaluation metrics

Makoto Hirohata; Yousuke Shinnaka; Koji Iwano; Sadaoki Furui

This paper presents automatic speech summarization techniques and its evaluation metrics, focusing on sentence extraction-based summarization methods for making abstracts from spontaneous presentations. Since humans tend to summarize presentations by extracting important sentences from introduction and conclusion parts, this paper proposes a method using sentence location. Experimental results show that the proposed method significantly improves automatic speech summarization performance for the condition of 10% summarization ratio. Results of correlation analysis between subjective and objective evaluation scores confirm that objective evaluation metrics, including summarization accuracy, sentence F-measure and ROUGE-N, are effective for evaluating summarization techniques.


Eurasip Journal on Audio, Speech, and Music Processing | 2007

Audio-visual speech recognition using lip information extracted from side-face images

Koji Iwano; Satoshi Tamura; Sadaoki Furui

This paper proposes an audio-visual speech recognition method using lip information extracted from side-face images as an attempt to increase noise robustness in mobile environments. Our proposed method assumes that lip images can be captured using a small camera installed in a handset. Two different kinds of lip features, lip-contour geometric features and lip-motion velocity features, are used individually or jointly, in combination with audio features. Phoneme HMMs modeling the audio and visual features are built based on the multistream HMM technique. Experiments conducted using Japanese connected digit speech contaminated with white noise in various SNR conditions show effectiveness of the proposed method. Recognition accuracy is improved by using the visual information in all SNR conditions. These visual features were confirmed to be effective even when the audio HMM was adapted to noise by the MLLR method.


Journal of the Acoustical Society of America | 2004

Noise‐robust speech recognition using multi‐band spectral features

Yoshitaka Nishimura; Takahiro Shinozaki; Koji Iwano; Sadaoki Furui

In most of the state‐of‐the‐art automatic speech recognition (ASR) systems, speech is converted into a time function of the MFCC (Mel Frequency Cepstrum Coefficient) vector. However, the problem with using the MFCC is that noise effects spread over all the coefficients even when the noise is limited within a narrow frequency band. If a spectrum feature is directly used, such a problem can be avoided and thus robustness against noise could be expected to increase. Although various researches on using spectral domain features have been conducted, improvement of recognition performances has been reported only in limited noise conditions. This paper proposes a novel multi‐band ASR method using a new log‐spectral domain feature. In order to increase the robustness, log‐spectrum features are normalized by applying the three processes: subtracting the mean log‐energy for each frame, emphasizing spectral peaks, and subtracting the log‐spectral mean averaged over an utterance. Spectral component likelihood values ...


international conference on acoustics, speech, and signal processing | 2005

Polyglot synthesis using a mixture of monolingual corpora

Javier Latorre; Koji Iwano; Sadaoki Furui

The paper proposes a new approach to multilingual synthesis based on an HMM synthesis technique. The idea consists of combining data from different monolingual speakers in different languages to create a single polyglot average voice. This average voice is then transformed into any real speakers voice of one of these languages. The speech synthesized in this way has the same intelligibility and retains the same individuality for all the languages mixed to create the average voice, regardless of the target speakers own language.


Speech Communication | 2005

Analysis and recognition of spontaneous speech using Corpus of Spontaneous Japanese

Sadaoki Furui; Masanobu Nakamura; Tomohisa Ichiba; Koji Iwano

Abstract Although speech is in almost any situation spontaneous, recognition of spontaneous speech is an area which has only recently emerged in the field of automatic speech recognition. Broadening the application of speech recognition depends crucially on raising recognition performance for spontaneous speech. For this purpose, it is necessary to analyze and model spontaneous speech using spontaneous speech databases, since spontaneous speech and read speech are significantly different. This paper reports analysis and recognition of spontaneous speech using a large-scale spontaneous speech database “Corpus of Spontaneous Japanese (CSJ)”. Recognition results in this experiment show that recognition accuracy significantly increases as a function of the size of acoustic as well as language model training data and the improvement levels off at approximately 7M words of training data. This means that acoustic and linguistic variation of spontaneous speech is so large that we need a very large corpus in order to encompass the variations. Spectral analysis using various styles of utterances in the CSJ shows that the spectral distribution/difference of phonemes is significantly reduced in spontaneous speech compared to read speech. It has also been observed that speaking rates of both vowels and consonants in spontaneous speech are significantly faster than those in read speech.


text speech and dialogue | 2005

Why is the recognition of spontaneous speech so hard

Sadaoki Furui; Masanobu Nakamura; Tomohisa Ichiba; Koji Iwano

Although speech, derived from reading texts, and similar types of speech, e.g. that from reading newspapers or that from news broadcast, can be recognized with high accuracy, recognition accuracy drastically decreases for spontaneous speech. This is due to the fact that spontaneous speech and read speech are significantly different acoustically as well as linguistically. This paper reports analysis and recognition of spontaneous speech using a large-scale spontaneous speech database “Corpus of Spontaneous Japanese (CSJ)”. Recognition results in this experiment show that recognition accuracy significantly increases as a function of the size of acoustic as well as language model training data and the improvement levels off at approximately 7M words of training data. This means that acoustic and linguistic variation of spontaneous speech is so large that we need a very large corpus in order to encompass the variations. Spectral analysis using various styles of utterances in the CSJ shows that the spectral distribution/difference of phonemes is significantly reduced in spontaneous speech compared to read speech. Experimental results also show that there is a strong correlation between mean spectral distance between phonemes and phoneme recognition accuracy. This indicates that spectral reduction is one major reason for the decrease of recognition accuracy of spontaneous speech.


international conference on acoustics, speech, and signal processing | 2004

A stream-weight optimization method for audio-visual speech recognition using multi-stream HMMs

Satoshi Tamura; Koji Iwano; Sadaoki Furui

For multi-stream HMM that are widely used in audio-visual speech recognition, it is important to automatically and properly adjust stream weights. This paper proposes a stream-weight optimization technique based on a likelihood-ratio maximization criterion. In our audiovisual speech recognition system, video signals are captured and converted into visual features using HMM-based techniques. Extracted acoustic and visual features are concatenated into an audio-visual vector. A multi-stream HMM is obtained from audio and visual HMM. Experiments are conducted using Japanese connected digit speech recorded in real-world environments. Applying the MLLR (maximum likelihood linear regression) adaptation and our optimization method, we achieve a 29% absolute accuracy improvement and a 76% relative error rate reduction compared with the audio-only scheme.

Collaboration


Dive into the Koji Iwano's collaboration.

Top Co-Authors

Avatar

Sadaoki Furui

Tokyo Institute of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Koichi Shinoda

Tokyo Institute of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Ryoichi Kurose

Japan Agency for Marine-Earth Science and Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge